Colab Notebook is here: https://colab.research.google.com/drive/1Iy2WjvmrbCeLUilDsSot017f1lGufdTv
1. Introduction to Python¶
Welcome to the first lesson of the Legal Data Analytics course. Let's start by understanding some key terms:
- The Code: This is the text you'll write, either in the Console when coding on your computer or within the cells of this Colab Notebook.
- The Console: The interactive environment where you input code to produce results. Variables and data persist as long as you don't restart the console or leave a Colab page. Remember to regularly save your data and code, as it won't stay in memory indefinitely.
- Comments: These are non-executable notes within your code, used to describe or explain what's happening. Comments in Python start with the hashtag
#
. - Errors: Occur when trying to run invalid code. Common errors include:
- Syntax Error: Caused by incorrect code structure, vital to avoid in Python.
- Type Error: Results from trying to combine incompatible data types.
- Colab: A cloud-based coding platform powered by Google. Unlike local coding, which uses your computer's resources, Colab runs on external servers. It's ideal for tasks requiring high computational power, but it has certain restrictions on data processing and package compatibility. For example, Colab might not be the best choice for web scraping tasks.
1. Computer code, at its most basic, calculates stuff. You can think of this course and everything that follows as expanding the uses of a calculator. For instance, if you input 2+2 in the Console, press "Enter" (or CTRL+Enter in Colab, or just click the left-hand side button), output will be 4.
2+2
# This is a very simple computation
4
Typically, in Python you'd use the command print
to have stuff appear on your screen. It is also more precise, as
it gives your computer the exact command to process: if you type two lines of computation before pressing enter,
only the last will render; however, both will render an output if you specify that both need to be printed.
2+2
3+3
6
print
is a command. Like most commands (or functions), it requires some arguments, that are
indicated
within
brackets - as here 2+2.
print(2+2)
print(2*3)
4 6
2. We will come back to functions a bit later. Before that, we need to discuss variables, which you can think of as recipients in which you store information.
Variable Types¶
Variables are typically written in lower caps; the way you create/assign data to a variable is with a =
sign, according
to the syntax variable = value
.
You can then use variables directly in functions (such as print), or do operations between them.
You can assign and re-assign variables at will: you can even assign a variable to another variable.
alpha = 1
beta = 2
gamma = 2 * 3
print(gamma)
print(alpha + beta + gamma)
6 9
Variables that contain numbers can also be added or subtracted to with a specific syntax: var += 2
means that 2
will be added to my variable, and this every time you input this particular command. Think of it as an update of the
original
variable.
gamma = gamma + alpha
print(gamma)
gamma += 1
print(gamma)
gamma -= 3
print(gamma)
-2 -1 -4
Strings¶
Variables need not be numbers. They can also be text, which in Python is known as a string
. Likewise, you can make operations with them, such as collating two strings.
alpha = "Hello World"
beta = 'Hello Cake'
var = " "
print(alpha + var + beta)
Hello World Hello Cake
Do note that the print command does exactly what you ask it to do: it did not insert a space between the two strings here, it's for you to think of this kind of details. Programming is deterministic: output follows input with, most of the time, no role for randomness. On the plus side, this means you should be assured that you'll get an output if we type proper input; on the minus side, this therefore requires utmost precision on your part.
Also important to keep into account is that strings are different from numbers. And you cannot, for instance, add strings to number: this would throw a TypeError.
delta = "5"
gamma = -3
print(str(gamma) + delta) # gamma has been defined above and is still known to the console's environment
-35
In what follows, we'll use text and strings taken from Mervyn Peake's poem The Frivolous Cake. I have numeroted every verse; we'll store it in a variable for now and come back to it later.
poem = """The Frivolous Cake
1.1 A freckled and frivolous cake there was
1.1 That sailed upon a pointless sea,
1.2 Or any lugubrious lake there was
1.3 In a manner emphatic and free.
1.4 How jointlessly, and how jointlessly
1.5 The frivolous cake sailed by
1.6 On the waves of the ocean that pointlessly
1.7 Threw fish to the lilac sky.
2.1 Oh, plenty and plenty of hake there was
2.1 Of a glory beyond compare,
2.2 And every conceivable make there was
2.3 Was tossed through the lilac air.
3.1 Up the smooth billows and over the crests
3.1 Of the cumbersome combers flew
3.2 The frivolous cake with a knife in the wake
3.3 Of herself and her curranty crew.
3.4 Like a swordfish grim it would bounce and skim
3.5 (This dinner knife fierce and blue) ,
3.6 And the frivolous cake was filled to the brim
3.7 With the fun of her curranty crew.
4.1 Oh, plenty and plenty of hake there was
4.1 Of a glory beyond compare -
4.2 And every conceivable make there was
4.3 Was tossed through the lilac air.
5.1 Around the shores of the Elegant Isles
5.1 Where the cat-fish bask and purr
5.2 And lick their paws with adhesive smiles
5.3 And wriggle their fins of fur,
5.4 They fly and fly neath the lilac sky -
5.5 The frivolous cake, and the knife
5.6 Who winketh his glamorous indigo eye
5.7 In the wake of his future wife.
6.1 The crumbs blow free down the pointless sea
6.1 To the beat of a cakey heart
6.2 And the sensitive steel of the knife can feel
6.3 That love is a race apart
6.4 In the speed of the lingering light are blown
6.5 The crumbs to the hake above,
6.6 And the tropical air vibrates to the drone
6.7 Of a cake in the throes of love."""
print(poem)
The Frivolous Cake 1.1 A freckled and frivolous cake there was 1.1 That sailed upon a pointless sea, 1.2 Or any lugubrious lake there was 1.3 In a manner emphatic and free. 1.4 How jointlessly, and how jointlessly 1.5 The frivolous cake sailed by 1.6 On the waves of the ocean that pointlessly 1.7 Threw fish to the lilac sky. 2.1 Oh, plenty and plenty of hake there was 2.1 Of a glory beyond compare, 2.2 And every conceivable make there was 2.3 Was tossed through the lilac air. 3.1 Up the smooth billows and over the crests 3.1 Of the cumbersome combers flew 3.2 The frivolous cake with a knife in the wake 3.3 Of herself and her curranty crew. 3.4 Like a swordfish grim it would bounce and skim 3.5 (This dinner knife fierce and blue) , 3.6 And the frivolous cake was filled to the brim 3.7 With the fun of her curranty crew. 4.1 Oh, plenty and plenty of hake there was 4.1 Of a glory beyond compare - 4.2 And every conceivable make there was 4.3 Was tossed through the lilac air. 5.1 Around the shores of the Elegant Isles 5.1 Where the cat-fish bask and purr 5.2 And lick their paws with adhesive smiles 5.3 And wriggle their fins of fur, 5.4 They fly and fly neath the lilac sky - 5.5 The frivolous cake, and the knife 5.6 Who winketh his glamorous indigo eye 5.7 In the wake of his future wife. 6.1 The crumbs blow free down the pointless sea 6.1 To the beat of a cakey heart 6.2 And the sensitive steel of the knife can feel 6.3 That love is a race apart 6.4 In the speed of the lingering light are blown 6.5 The crumbs to the hake above, 6.6 And the tropical air vibrates to the drone 6.7 Of a cake in the throes of love.
var = poem[1199:1200]
var.encode("utf8").decode()
Lists¶
Another type of variable is a list, which is exactly what you think it is: it lists things, such as data, or even other variables, or even other lists ! Lists are denoted by using brackets and commas. You update a list by using the function .append() directly from the list, as follows: see that the item you added with append now appears at the end of the list.
beta = "Cake"
my_list = ["Frivolous", 42, beta, ["This is a second list, with two items", 142], "Peake"]
print(my_list)
my_list.append("Swordfish")
print(my_list)
my_list.append("Swordfish")
print(my_list)
['Frivolous', 42, 'Cake', ['This is a second list, with two items', 142], 'Peake'] ['Frivolous', 42, 'Cake', ['This is a second list, with two items', 142], 'Peake', 'Swordfish'] ['Frivolous', 42, 'Cake', ['This is a second list, with two items', 142], 'Peake', 'Swordfish', 'Swordfish']
A very important feature of lists is that they are ordered. This means if you know the (numerical) index of an item in a list, you can access it immediately. This is called "indexing".
Learn it once and for all: in Python, indexes start at 0; the first element in a list can be found at index 0. This is not intuitive, but you need to get used to it: 0, not 1, marks the beginning of a list.
print(my_list[0])
print(my_list[1])
print(my_list[4][1])
print(my_list[3][1] + my_list[1])
Frivolous 42 e 184
Note that the last indexing returns the second list that was in my_list. As such, it can also itself be indexed.
Indexing also works using the relative position of an item in a list: [-1] gives you the last item, [-2] the penultimate, etc.
print(my_list[-1]) # Will return 'Peake', the antepenultimate term since we added 'Swordfish' as last two terms
Swordfish
print(my_list[-0])
Frivolous
More importantly, you can select what's called a range by using the :
operator. The operator is not inclusive of the outer limit, meaning that the item on the right-hand-side of the :
operator won't be included in the list that is rendered. For instance, if you look for indexes [0:2]
, you'll get items at index 0 and index 1, but not 2 (because it's excluded).
You can leave the selection open-ended, according to the same principles: the right-hand-side index won't be included, but the left-hand-side one is. So [:5]
means "any element until the 6th (not included)", while [2:]
means "every element after the third element (included)".
print(my_list)
['Frivolous', 42, 'Cake', ['This is a second list, with two items', 142], 'Peake', 'Swordfish', 'Swordfish']
print(my_list[1:3])
print(my_list[2:])
print(my_list[-2:])
print(my_list[0::3]) # This last type of range gives you every 3 items starting from 0
[42, 'Cake'] ['Cake', ['This is a second list, with two items', 142], 'Peake', 'Swordfish', 'Swordfish'] ['Swordfish', 'Swordfish'] ['Frivolous', ['This is a second list, with two items', 142], 'Swordfish']
print(my_list[:2][0][0])
F
Booleans¶
Another data type is what's called a boolean. It is simply a statement True or False, but it is often very useful when you have to check conditions. It's based on the logic invented by George Boole in the mid-1800s, which is basically what powers computers now (see here), the bunch of 0s and 1s that signify computer data.
var_bol = True
var_bol2 = False
print(bool(var_bol))
print(bool(var_bol2))
True False
Boolean logic works by manipulating True
and False
statements. In Python, you often need to check if something is True
or not, for instance in the context of conditions (next module). The most basic way to do this is with the ==
(double equal - not to be confused with single equal, which is used to assign a variable). Its opposite is !=
.
gamma = 5
print(gamma == 5) # Since gamma is indeed 5, this prints True
print(gamma != 5) # Since gamma is not different from 5, this prints False
True False
This sounds basic, but Booleans are really at the basis of everything. The whole idea of smart contracts, for instance, is premised on the concept of assigning True or False to various contract terms and performances. Booleans will be particularly helpful when we get to conditions, in the Syntax module.
Sets and Dictionaries¶
Finally, there are two other types of data worth knowing at this stage: sets
and dictionaries
.
Sets are like lists (they can take any sort of variable, but not a list), except they are unordered, and they can't have duplicates. They are very useful to check if two sets of data overlap, or what they have or don't have in common. Since they are not ordered, you cannot select an element from a set. If you create a set with a duplicate element, it will ignore it and returns a set without the duplicate.
my_set = {1, 2, 2, 3, 3, 4, "Cake", "Cake", "Knife"}
print(my_set)
{1, 2, 3, 4, 'Knife', 'Cake'}
A Dictionary is a type of data that links a key
to a value
. The key becomes the index of your dictionary; if you give a key to the dictionary, it will return the value. It is useful to track down relations between different data points. Here as well, you can use any type of data you want. You use brackets, and indicate the relationship with a ":" operator.
my_dict = {42: "Mervyn", "Peake": 2, "My List" : my_list}
print(my_dict[42])
print(my_dict["My List"])
Mervyn ['Frivolous', 42, 'Cake', ['This is a second list, with two items', 142], 'Peake', 'Swordfish', 'Swordfish']
Exercice¶
Before switching to the next section, find a way to print "Mervyn Peake" indexing both the list
my_list
and the dictionary my_dict
.
my_list = ["Frivolous", 42, ["This is a second list, with two items", 142], "Peake"]
my_dict = {42: "Mervyn", "Peake": 2, "My List" : my_list}
# Your code
Functions¶
Now, coming back to functions, they are what allows you to do operations over data and variables in Python (more info here). For this, you need to pass it the expected arguments.
Many functions are native to Python, meaning you don't need to either create them yourself, or import them from an existing library. Amongst these native functions are those that allow you to play with the types of variables. For instance, str
transform your variable into a string, int
into a number, list
into a list, etc. The function set
takes a list and returns a set, while the function len
tells you how long a variable is.
sent = "How long is this sentence ? :"
print(sent, len(sent))
print(list(sent))
print(round(2.3))
print(set([1, 1, 1, 3]))
print(str(3) + "2")
How long is this sentence ? : 29 ['H', 'o', 'w', ' ', 'l', 'o', 'n', 'g', ' ', 'i', 's', ' ', 't', 'h', 'i', 's', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', ' ', '?', ' ', ':'] 2 {1, 3} 32
If you are not sure what a function does, you can always type help(function)
, and the console shall return an answer.
You can create your own functions with the specific term def
, then give it a name, and specify expected arguments within brackets.
Then, and this is crucial, you input a new line, and a tab - the inside of your function should not be on the same line as the declaration, everything should be shifted by one tab. (Indentation is a key syntax method we'll see again and again.)
The example given is a very simple function returning a sum, and while you'd get the same result merely by doing the sum immediately (without passing it to a function), it's sometimes useful to write down things formally in this fashion.
def my_function(alpha, beta): # An example of a function that just returns the sum of the two arguments you pass to it
# (which will be known as beta and alpha in the sole context of the function)
return beta + alpha
print(my_function(1, 2)) # We call the function with brackets to include the expected arguments
print(my_function("E", "H")) # Notice that the order of the arguments is important
3 HE
However, Python has already plenty of built-in
functions, so that you don't have to invent the wheel everytime you need to do something. As you already saw, you
don't need a function to calculate the sum of two variables, so my_function
is redundant.
Python is a shared resources, and for most uses someone has already created a function for you, such that you just need to import it. This is the system of packages and librairies that power Python.
When you find a package or module that intests you, you should first install it (from the internet) in your local
Python environment. The way to do this is with the command pip install X
, with X being the name of your package ;
you type this command in the Terminal (not the Console), unless you have downloaded the iPython package (which
boosts the console and allows direct installations). Once a package it's installed, you don't have to do it again. (On you computer, that is. On Colab, you need to redo it every time, since you are only renting temporary computing capacity on the cloud)
!pip install llm
Collecting llm Downloading llm-0.19.1-py3-none-any.whl.metadata (6.5 kB) Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from llm) (8.1.7) Requirement already satisfied: openai>=1.0 in /usr/local/lib/python3.10/dist-packages (from llm) (1.57.4) Collecting click-default-group>=1.2.3 (from llm) Downloading click_default_group-1.2.4-py2.py3-none-any.whl.metadata (2.8 kB) Collecting sqlite-utils>=3.37 (from llm) Downloading sqlite_utils-3.38-py3-none-any.whl.metadata (7.5 kB) Collecting sqlite-migrate>=0.1a2 (from llm) Downloading sqlite_migrate-0.1b0-py3-none-any.whl.metadata (5.4 kB) Requirement already satisfied: pydantic>=1.10.2 in /usr/local/lib/python3.10/dist-packages (from llm) (2.10.3) Requirement already satisfied: PyYAML in /usr/local/lib/python3.10/dist-packages (from llm) (6.0.2) Requirement already satisfied: pluggy in /usr/local/lib/python3.10/dist-packages (from llm) (1.5.0) Collecting python-ulid (from llm) Downloading python_ulid-3.0.0-py3-none-any.whl.metadata (5.8 kB) Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from llm) (75.1.0) Requirement already satisfied: pip in /usr/local/lib/python3.10/dist-packages (from llm) (24.1.2) Collecting puremagic (from llm) Downloading puremagic-1.28-py3-none-any.whl.metadata (5.8 kB) Requirement already satisfied: anyio<5,>=3.5.0 in /usr/local/lib/python3.10/dist-packages (from openai>=1.0->llm) (3.7.1) Requirement already satisfied: distro<2,>=1.7.0 in /usr/local/lib/python3.10/dist-packages (from openai>=1.0->llm) (1.9.0) Requirement already satisfied: httpx<1,>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from openai>=1.0->llm) (0.28.1) Requirement already satisfied: jiter<1,>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from openai>=1.0->llm) (0.8.2) Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from openai>=1.0->llm) (1.3.1) Requirement already satisfied: tqdm>4 in /usr/local/lib/python3.10/dist-packages (from openai>=1.0->llm) (4.67.1) Requirement already satisfied: typing-extensions<5,>=4.11 in /usr/local/lib/python3.10/dist-packages (from openai>=1.0->llm) (4.12.2) Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.10/dist-packages (from pydantic>=1.10.2->llm) (0.7.0) Requirement already satisfied: pydantic-core==2.27.1 in /usr/local/lib/python3.10/dist-packages (from pydantic>=1.10.2->llm) (2.27.1) Collecting sqlite-fts4 (from sqlite-utils>=3.37->llm) Downloading sqlite_fts4-1.0.3-py3-none-any.whl.metadata (6.6 kB) Requirement already satisfied: tabulate in /usr/local/lib/python3.10/dist-packages (from sqlite-utils>=3.37->llm) (0.9.0) Requirement already satisfied: python-dateutil in /usr/local/lib/python3.10/dist-packages (from sqlite-utils>=3.37->llm) (2.8.2) Requirement already satisfied: idna>=2.8 in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.5.0->openai>=1.0->llm) (3.10) Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.5.0->openai>=1.0->llm) (1.2.2) Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.23.0->openai>=1.0->llm) (2024.12.14) Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.23.0->openai>=1.0->llm) (1.0.7) Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.10/dist-packages (from httpcore==1.*->httpx<1,>=0.23.0->openai>=1.0->llm) (0.14.0) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil->sqlite-utils>=3.37->llm) (1.17.0) Downloading llm-0.19.1-py3-none-any.whl (44 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.5/44.5 kB 1.7 MB/s eta 0:00:00 Downloading click_default_group-1.2.4-py2.py3-none-any.whl (4.1 kB) Downloading sqlite_migrate-0.1b0-py3-none-any.whl (10.0 kB) Downloading sqlite_utils-3.38-py3-none-any.whl (68 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68.2/68.2 kB 3.0 MB/s eta 0:00:00 Downloading puremagic-1.28-py3-none-any.whl (43 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43.2/43.2 kB 2.5 MB/s eta 0:00:00 Downloading python_ulid-3.0.0-py3-none-any.whl (11 kB) Downloading sqlite_fts4-1.0.3-py3-none-any.whl (10.0 kB) Installing collected packages: sqlite-fts4, puremagic, python-ulid, click-default-group, sqlite-utils, sqlite-migrate, llm Successfully installed click-default-group-1.2.4 llm-0.19.1 puremagic-1.28 python-ulid-3.0.0 sqlite-fts4-1.0.3 sqlite-migrate-0.1b0 sqlite-utils-3.38
import llm
Note that the "pip install X" is not Python code; it's a command line function, which is different.
Now, in order to use the packages you downloaded, you need to import them every time you restart the console. Every script file starts with "import statements" which indicates which packages you will be using in the context of this
script. You need the keyword import
, and you can give aliases to the modules with the keyword as
.
If you get an error when trying to import a package, usually it's because you have not installed it (with "pip install" as described above).
You can either import a full package (such as numpy
here), or dedicated functions or "modules" in a package (using
the "from X import Y" syntax. The difference is in terms of performance (some packages are heavy). You then call the functions in accordance with
their name as imported, a name that you can set yourself (some are conventional: pandas
is nearly always pd
,
numpy
is np
, etc.)
import numpy as np # We import numpy, but it's typically aliased 'np'
from collections import Counter # we can also import only a selected functions from a package
import pandas as pd
The first package here, numpy
is specialised in numbers and mathematical operations; it typically goes
further than the basic Python functions. For instance, if you want to compute a mean: you could just use sum
and
divide by len
; but it's easier to just use np.mean()
on a list of numbers.
The syntax is always the same: you go from a module to a function by adding a period (.
), and then add the required
parameters to your function. If you don't know what is the required parameters, you can usually use CTRL+P to learn
more about the inside of a function.
np.mean([1, 5, 10, 15]) # Instead of creating a function to calculate a mean, we can just leverage the existing package numpy
Methods¶
All of the data and variables you will manipulate in this course will have built-in functions (called
"methods") already attached to them. They also have what's called attributes, which are data points. Methods take an
argument (within brackets),
while attributes have no brackets (and only output the data point).
(We won't learn it, but it's related to classes
in Python, about which you can learn more here).
We saw it earlier when we used append
to add an item to a list. The equivalent method for sets is add
.
We will be working on legal data, which means, for a large part, text data. Fortunately, strings are quite easy to
work with, as they have built-in functions in Python. For instance, the function split
can be used to obtain a
list of items in that string depending on a splitting criterion. The opposite of that function would be join
,
whereby you join items in a list with a common character.
These two functions are also a good example of how Python works between different data types: lists become strings, and the reverse.
splitted_words = "A freckled and frivolous cake there was".split(" ")
# The variable splitted_verse will take the result of the right-hand side expression, which splits a string according to a criterion
splitted_cake = "A freckled and frivolous cake there was".split("cake")
print(splitted_words)
print(splitted_cake)
print("".join(splitted_words))
['A', 'freckled', 'and', 'frivolous', 'cake', 'there', 'was'] ['A freckled and frivolous ', ' there was'] Afreckledandfrivolouscaketherewas
Sets have a number a functions attached to them as well, allowing for comparisons between sets:
difference
will return the difference between set1 and 2;intersection
, for items that are in both sets;union
returns a set with both sets' content; andsymmetric_difference
returns the items that are only in one set
set1 = {1,2,3,4}
set2 = {3,4,5,6}
print(set1.difference(set2))
print(set1.intersection(set2))
print(set1.union(set2))
print(set1.symmetric_difference(set2))
{1, 2} {3, 4} {1, 2, 3, 4, 5, 6} {1, 2, 5, 6}
Exercises¶
- Find the two sentences from the
sentences
list that have the most letters in common (each letter counted only once).
sentences = ["1.1 A freckled and frivolous cake there was", "1.1 That sailed upon a pointless sea, ", "1.2 Or any lugubrious lake there was", "1.3 In a manner emphatic and free."]
# Your code here
seta = set(list(sentences[0]))
setb = set(list(sentences[1]))
setc = set(list(sentences[2]))
setd = set(list(sentences[3]))
print(seta.intersection(setb))
print(seta.intersection(setc))
print(seta.intersection(setd))
print(setb.intersection(setc))
print(setb.intersection(setd))
print(setc.intersection(setd))
{'t', 'o', '.', 'd', 'l', '1', ' ', 's', 'i', 'u', 'h', 'n', 'a', 'e'} {'t', '.', 'o', 'r', 'l', ' ', '1', 's', 'i', 'u', 'w', 'h', 'n', 'a', 'k', 'e'} {'t', '.', 'd', 'r', 'f', ' ', '1', 'i', 'c', 'h', 'n', 'a', 'e'} {'t', 'o', '.', 'l', '1', ' ', 's', 'i', 'u', 'h', 'n', 'a', 'e'} {'t', '.', 'd', '1', ' ', 'i', 'p', 'h', 'n', 'a', 'e'} {'t', '.', 'r', '1', ' ', 'i', 'h', 'n', 'a', 'e'}
- Find the number of words that are common to two paragraphs of the poems. There are (at least) three steps to do so.
a = 'freckled and frivolous cake there was\nThat sailed upon a pointless sea, \nOr any lugubrious lake there was\nIn a manner emphatic and free.\nHow jointlessly, and how jointlessly\nThe frivolous cake sailed by\nOn the waves of the ocean that pointlessly\nThrew fish to the lilac sky.'
b = 'Around the shores of the Elegant Isles\nWhere the cat-fish bask and purr\nAnd lick their paws with adhesive smiles\nAnd wriggle their fins of fur, \nThey fly and fly neath the lilac sky -\nThe frivolous cake, and the knife\nWho winketh his glamorous indigo eye\nIn the wake of his future wife.'
a_s = a.split(" ")
b_s = b.split(" ")
seta = set(a_s)
setb = set(b_s)
intersection = seta.intersection(setb)
print(intersection)
{'of', 'and', 'frivolous', 'lilac', 'the'}
Syntax¶
Finally, some notions of syntaxes. You write code as you would write anything: sequentially. This means you first define your variables or your functions before using it, or Python won't be able to know what you mean. This being said, there are two basic syntaxic ideas that are crucial to any coding script - or indeed, to any software you are currently using. These are loops, and conditions.
Loops¶
A loop
, tells Python to go over (the term is "iterate") a number of elements, most often from a list.
The syntax is always the same: for x in list: y
, where "x" represents the temporary name of element in the "list" in
turn, and "y" what happens to that "x". In other words, start with the first element (called "x" in the context of
the loop), do stuff ("y") with that element, then go over the
next element (which will also be called "x"), and so on.
sentences = ["1.1 A freckled and frivolous cake there was", "1.1 That sailed upon a pointless sea, ", "1.2 Or any lugubrious lake there was", "1.3 In a manner emphatic and free."]
print(sentences[0].upper())
print(sentences[1].upper())
1.1 A FRECKLED AND FRIVOLOUS CAKE THERE WAS 1.1 THAT SAILED UPON A POINTLESS SEA,
for x in sentences:
newphrase = x.upper()
print(newphrase)
1.1 A FRECKLED AND FRIVOLOUS CAKE THERE WAS 1.1 THAT SAILED UPON A POINTLESS SEA, 1.2 OR ANY LUGUBRIOUS LAKE THERE WAS 1.3 IN A MANNER EMPHATIC AND FREE.
words = ["A", "Freckled", "and", "Frivolous", "Cake", "There", "Was"]
for x in words: # This loop will print each word from the list one by one
var = x.upper()
print(var)
A FRECKLED AND FRIVOLOUS CAKE THERE WAS
After the loop has been completed, the variable x
is still available: it represents whatever was the last item iterated over.
print(x)
You will note that for your loop to work, the second level of instructions needs to be shifted to the right (and you
have a colon at the end of your for
statement). That's
called identation, and this is crucial in Python. It's also one of the main reasons why people don't like this
language. Other languages are more explicit as to when a section of your code is actually contained in another
section: for example, in C++ you would put stuff within brackets, or indicate the end of a statement with a semi-colon.
You can loop over lists, strings, and other objects we will discover later.
recreated_text = "" # We start by creating an empty text variable
ii = 0
for letter in ["Swordfish", "HEC"]: # We loop over the string Swordfish (strings can be used as lists of letters)
print(letter) # We first print the letters, one by one
recreated_text += letter # Then we add the letter to the existing recreated text; remember that x += 1 increment x by 1
print(recreated_text)
ii += 1
print(ii)
Swordfish Swordfish HEC SwordfishHEC 2
As everywhere else in Python, the order of things is very important, including in the context of a loop.
for x in words: # We loop over the words
y = x # We assign a new variable y that's the same as every x, one by one
print(x + " - " + y)
for x in words: # In this second loop, y has not been assigned yet, so it is still the last-assigned y
print(x + " - " + y)
y = x
A - A Freckled - Freckled and - and Frivolous - Frivolous Cake - Cake There - There Was - Was A - Was Freckled - A and - Freckled Frivolous - and Cake - Frivolous There - Cake Was - There
Exercise¶
Without using any dedicated function, reverse the order of the words
list, in a new list called sdrow
. Print the new list.
# Hint: note that concatenating two lists result in a single list
a = ["x"]
b = ["y"]
c=b+a
print(c)
['y', 'x']
words = ["A", "Freckled", "and", "Frivolous", "Cake", "There", "Was"]
l = []
# Your code here
for x in words:
l = [x] + l
print(l)
print(l)
['A'] ['Freckled', 'A'] ['and', 'Freckled', 'A'] ['Frivolous', 'and', 'Freckled', 'A'] ['Cake', 'Frivolous', 'and', 'Freckled', 'A'] ['There', 'Cake', 'Frivolous', 'and', 'Freckled', 'A'] ['Was', 'There', 'Cake', 'Frivolous', 'and', 'Freckled', 'A'] ['Was', 'There', 'Cake', 'Frivolous', 'and', 'Freckled', 'A']
sdrow = []
for x in range(len(words) -1, -1, -1):
sdrow.append(words[x])
print(sdrow)
['Was', 'There', 'Cake', 'Frivolous', 'and', 'Freckled', 'A']
Conditions¶
The second important syntax element, and really the basic building block of so much code that runs your
daily life, is the if/else
statement. It simply asks if a condition is met (with booleans !), and then accomplish the resulting code.
The syntax is of the form if x:
, where "x" need to be True
(in the boolean sense) for the (indented) code coming after the colon to output. 2 + 2 = 4, so a statement if 2+2 == 4: print("Correct")
would print correct. (Note that
we use "==" to check an identity, since the single "=" sign is used to assign variables.)
In the example below, we will check that the letter "e" (i.e., a string corresponding to the lower case "e") is present in a list of words. This simply requires to check if that letter is if
the target variable.
Note that the else
will compute only if the if
condition has not been met.
var = 3
if var == 4:
print("XX")
elif var != 3:
print("NN")
elif var == 2:
print("OO")
else:
print("YY")
YY
if var == 4 and var !=3 or var == 3:
print("OO")
else:
print("YY")
OO
words
['A', 'Freckled', 'and', 'Frivolous', 'Cake', 'There', 'Was']
"a" in HEC
True
var = True
for x in words:
if "e" in x :
print(x)
else:
print(x, " : No 'e' in that word")
A : No 'e' in that word Freckled and : No 'e' in that word Frivolous : No 'e' in that word Cake There Was : No 'e' in that word
They are several ways to syntax if
statements:
- With an
is in
if need to check that an item is part of a list or a set (or the inverseis not in
); - By itself if you are checking a boolean (
if my_bol:
will returnTrue
orFalse
depending on the value ofmy_bol
); - With the double equal sign
==
for identity between two variables, or!=
for lack of identity; and - With a combination of the signs
>
,<
and=
when comparing two quantities.
Finally, you can add conditions with the keywords and
and or
.
In case you want to try a second condition after a first one is not met, you can use the keyword elif
("else if"), which works exactly like if.
sentence = " ".join(words) # We recreate the sentence from the list of words with the method join
print(sentence)
A Freckled and Frivolous Cake There Was
sentence = " ".join(words) # We recreate the sentence from the list of words with the method join
print(sentence)
my_bol = False # We set a boolean that's False
if "frivolous" in sentence:
print("First Condition Met")
elif my_bol:
print("Second Condition Met")
elif len(sentence) == 150: # len() is a built-in function rendering the length of a list or string
print("Third Condition Met")
elif len(sentence) >= 30 and "e" in sentence or "cake" not in sentence:
print("Fourth Condition Met")
else:
pass
A Freckled and Frivolous Cake There Was Fourth Condition Met
This is all, or nearly. On the basis of these very basic concepts run most of the rest of the Python scripts you can see out there.
Exercise¶
Find the (i) longest sentence in the poem that has (ii) the sound "ake" but (iii) not the word "knife", but (iv) has fewer than 8 words (not counting line numbers).
poem = 'The Frivolous Cake\n1.1 A freckled and frivolous cake there was\n1.1 That sailed upon a pointless sea, \n1.2 Or any lugubrious lake there was\n1.3 In a manner emphatic and free.\n1.4 How jointlessly, and how jointlessly\n1.5 The frivolous cake sailed by\n1.6 On the waves of the ocean that pointlessly\n1.7 Threw fish to the lilac sky.\n\n2.1 Oh, plenty and plenty of hake there was\n2.1 Of a glory beyond compare, \n2.2 And every conceivable make there was\n2.3 Was tossed through the lilac air.\n\n3.1 Up the smooth billows and over the crests\n3.1 Of the cumbersome combers flew\n3.2 The frivolous cake with a knife in the wake\n3.3 Of herself and her curranty crew.\n3.4 Like a swordfish grim it would bounce and skim\n3.5 (This dinner knife fierce and blue) , \n3.6 And the frivolous cake was filled to the brim\n3.7 With the fun of her curranty crew.\n\n4.1 Oh, plenty and plenty of hake there was\n4.1 Of a glory beyond compare -\n4.2 And every conceivable make there was\n4.3 Was tossed through the lilac air.\n\n5.1 Around the shores of the Elegant Isles\n5.1 Where the cat-fish bask and purr\n5.2 And lick their paws with adhesive smiles\n5.3 And wriggle their fins of fur, \n5.4 They fly and fly \x91neath the lilac sky -\n5.5 The frivolous cake, and the knife\n5.6 Who winketh his glamorous indigo eye\n5.7 In the wake of his future wife.\n\n6.1 The crumbs blow free down the pointless sea\n6.1 To the beat of a cakey heart\n6.2 And the sensitive steel of the knife can feel\n6.3 That love is a race apart\n6.4 In the speed of the lingering light are blown\n6.5 The crumbs to the hake above, \n6.6 And the tropical air vibrates to the drone\n6.7 Of a cake in the throes of love.'
longest_line = ""
# Step 1: Get a list of lines
sents = poem.split("\n")
longest_line = ""
# Step 2: loop over list of lines
for line in sents:
words = line.split(" ")
if 'ake' in line and "knife" not in line and len(words[2:]) < 8:
#print(line)
if len(line) > len(longest_line):
print(line)
longest_line = line
print(longest_line)
# 3 Check les conditions
# 4 Compare length of valid sentences to previous valid sentencesxs
The Frivolous Cake 1.1 A freckled and frivolous cake there was 1.1 A freckled and frivolous cake there was
line[5:]
'Of a cake in the throes of love.'
wordsb = line[5:].split(" ")
print(wordsb)
['Of', 'a', 'cake', 'in', 'the', 'throes', 'of', 'love.']
words[2:]
['Of', 'a', 'cake', 'in', 'the', 'throes', 'of', 'love.']
Regexes¶
Earlier, we devised a basic algorithm to count the number of words in a text. However, there is a much better, simple way to do this: It's time to introduce regular expressions, or "regex" for short. (more info here)
We'll spend some time on it because it is extremely important for text-heavy applications; in a course about finance or statistics we would not need it too much, but since we'll be analysing judgments and legal texts, regexes are essential. And they are great. At the end of this task, you'll be annoyed every time search engines (like Google) don't do regex. It's just so much better.
Regexes are patterns that allow you to identify text. These patterns rely on special symbols to cover a range of characters in natural, written language. Because they rely on patterns, it's much more powerful than a search that focuses on a specific word: the word itself might be conjugated, or put in lower caps; a sentence could have extra words. You might be interested in a range of number and not a specific one, etc.
For instance, the symbol "\d" means "any number", and if you try to match this pattern with a sentence that includes a number, there will be a positive result.
import regex as re # You need to import the regex module
target_sentence = "Count: 37 frivolous cakes and 40 knifes !"
pattern = "4\d"
result = re.search(pattern, target_sentence)
print(result)
<regex.Match object; span=(30, 32), match='40'>
# What's that regex ?
"\d\d-\d\d-\d\d\d\d"
Regex.search() will return a regex object (here, the variable result
), which comes with a number of characteristics. For instance, that object stores the start of the matching pattern in the target sentence, as well as its end, and the exact matched pattern (method ".group()").
print("Pattern was found at index ", result.start(), " of target string !")
print("String continued after pattern at index ", result.end())
print("Regex search found ", result.group(), " that matched this pattern")
You'd note that there were several numbers in the target sentence, but the "search" function only found one - the first
one. To get all matches, you need another function, which is findall
, and returns a list of result.
re.findall("\d+", target_sentence)
['37', '40']
In addition, you have re.sub(pattern, newpattern, target_sentence)
, that substitutes a pattern for a new
pattern.
There is also re.split(pattern, target_sentence)
which returns a list of strings from the original text, as
split by the pattern. Notice that the result does not display the splitting pattern.
print(re.sub("Cake", "Hake (?!)", poem[:19]))
print(re.split(" |v", poem[:20]))
The Frivolous Hake (?!) ['The', 'Fri', 'olous', 'Cake\n1']
All very good, now, here are the basic patterns:
- Any particular word or exact spelling will match itself:
cake
will matchcake
(but notCake
, unless you command regex to be case-insensitive - see below); .
, catches anything, really, soc.ke
would get "cake" or "coke", or even "cOke"; if you need to look specifically for a period, you need to escape it with an antislash\.
\s
matches white spaces, including line breaks, etc.; note that the upper-case version,\S
,matches anything but a white space; and\w
matches a letter, while\W
matches anything but a letter.
print(re.findall(".ake", poem)) # Plenty of "ake" sounds in that poem
print(re.findall("\d\.\d", poem)) # Too look for a period, you need to escape it with an antislash
print(re.search("\W", poem)) # It will find the first space in the poem
['Cake', 'cake', 'lake', 'cake', 'hake', 'make', 'cake', 'wake', 'cake', 'hake', 'make', 'cake', 'wake', 'cake', 'hake', 'cake'] ['1.1', '1.1', '1.2', '1.3', '1.4', '1.5', '1.6', '1.7', '2.1', '2.1', '2.2', '2.3', '3.1', '3.1', '3.2', '3.3', '3.4', '3.5', '3.6', '3.7', '4.1', '4.1', '4.2', '4.3', '5.1', '5.1', '5.2', '5.3', '5.4', '5.5', '5.6', '5.7', '6.1', '6.1', '6.2', '6.3', '6.4', '6.5', '6.6', '6.7'] <regex.Match object; span=(3, 4), match=' '>
In addition, the following rules apply:
- Square brackets can be used to indicate a range of characters, such as
[0-8a-q,]
will only look for a number between 0 and 8 OR a letter between a and q, or a comma (if you need hyphens in your range, put them at the end of the range); - The symbol
|
(that's Alt + 6 on your keyboard) means "or"; - You'd indicate the expected number of hits with braces:
[A-Q]{3}
means you are looking for three (consecutive) upper-case letters between A and Q, while[A-Q]{3,6}
means you expect between 3 and 6, and[A-Q]{3,}
means "at least 3" (but potentially more), on the same logic as indexing (except use commas instead of colons). - Two special characters do the same job, but open-ended, "+" means that you are expected
at least one hit, while
*
means you expect any number of hits (including none; add a?
for non-greediness). A concrete example would be\d{4}
: a date; - Any pattern becomes optional if you add a
?
behind it:cakey?
will findcakey
orcake
; - You can group patterns by bracketing them with parentheses, and then build around it: for instance,
( [0-8a-q])|([9r-z])
. You can even name the groups to retrieve them precisely from the regex object when there is a match. - Characters that are usually used for patterns (such as
?
or|
) can be searched for themselves by "escaping" them with an anti-slash\
(and the antislash can be escaped with another antislash:\\
will look for\
). Note thatregex
provides you with anescape
function that returns a pattern, but escaped.
print(re.findall("[chlwm]ake", poem))
print(re.search("cake|knife", poem))
print(re.search("\d\d-\d{2}-\d+", "This is a date: 11-02-1992")) # Note that \d\d and \d{2} are strictly equivalent
print(re.search("cakey?", "cake or cakey?")) # Here as well, if you ever need to look for an "?", you need to escape it: "\?"
['cake', 'lake', 'cake', 'hake', 'make', 'cake', 'wake', 'cake', 'hake', 'make', 'cake', 'wake', 'cake', 'hake', 'cake'] <regex.Match object; span=(49, 53), match='cake'> None <regex.Match object; span=(0, 4), match='cake'>
Finally, there are so-called flags that are typically used outside of the pattern (but can be used inside for a single sub-pattern), as a third argument, to indicate further instructions, such as:
- Ignorecase,
re.I
; - Ignore linebreaks
re.S
; - Verbose (allows you to add white spaces that don't count as pattern),
re.X
; and - Multilines (
$
and^
will work for any single line, and not simply for the start and end of the full text),re.M
print(re.search("cake", "The Frivolous Cake", re.I)) # This works despite the capital C since we specified re.I
Regex really turns powerful in that you can add a number of conditions to you regex pattern.
- A pattern preceded by a
^
will be looked for only at the beginning of a line; a pattern followed by a$
will only look for it if it finishes the line or text; - Adding a
(?=2ndpattern)
after your first pattern will indicate that your first pattern will match only if the target text matches your second pattern, but the second pattern won't be caught by the regex object (this is very useful, e.g., for substitution). - In the same vein,
(?!2ndpattern)
,(?<=2ndpattern)
, and(?!<2ndpattern)
are conditions for "if it does not match after"; "if it matches before", and "if it doesn't match before", respectively. This can be hungry in terms of computing power, so don't overdo it.
print(re.search("^A Freckled|throes of love.$", poem)) # Only the second alternative will be found, since the first words are not at
# the beginning of a line (the numbers 1.1 are)
print(re.search("plenty of (?=cake)", poem)) #This returns None since there are no "plenty of cake" in the poem
print(re.search("plenty of (?=.ake)", poem)) #But this returns a match, since there is "hake"
Latest versions of regex also provides for fuzzy searches - that is, with a bit of leeway to catch things despite errors in the pattern (this is exponentially greedy in resources, though, so be careful when you use it). For instance, re.search("(coke){e<=1}", poem)
, where the braced statement means "one or less errors (e)" will find "cake", as there is only one difference (the latter o/a) between the pattern and the word.
Finally, regex objects count as boolean: if result
will return True
if there was a match, while you can check for a null result by asking "if result is None". (None
is a special Python object that means that data is empty.)
Note that there are tools to help you check if your regexes work well on the given dataset, such as this one online.
for line in poem.split("\n"): # We split the poem by lines and we loop over these lines
if re.search("cake|knife", line, re.I): # we check that the term "cake" is or not in the line
print(line) # If it is, we print the line
else:
print("No Cake or knife in that line...")
Exercise¶
Print every line that includes a word that starts with "b" or "h" and has no more than 4 letters.
# Your code here
import regex as re
for x in poem.split("\n"):
result = re.search(" ([bh].){1,4}", x, re.I)
if result:
print(x)
1.4 How jointlessly, and how jointlessly 1.5 The frivolous cake sailed by 2.1 Oh, plenty and plenty of hake there was 2.1 Of a glory beyond compare, 3.1 Up the smooth billows and over the crests 3.3 Of herself and her curranty crew. 3.4 Like a swordfish grim it would bounce and skim 3.5 (This dinner knife fierce and blue) , 3.6 And the frivolous cake was filled to the brim 3.7 With the fun of her curranty crew. 4.1 Oh, plenty and plenty of hake there was 4.1 Of a glory beyond compare - 5.1 Where the cat-fish bask and purr 5.6 Who winketh his glamorous indigo eye 5.7 In the wake of his future wife. 6.1 The crumbs blow free down the pointless sea 6.1 To the beat of a cakey heart 6.4 In the speed of the lingering light are blown 6.5 The crumbs to the hake above,
Wordle¶
This is a popular online game ! Let's try to reproduce it.
We first need to make the required imports:
- a module to simulate randomness (so as to have a new word every time we play);
- regex to check if texts match what we want; and
- a corpus of words to pick from.
import random
import re
import nltk
nltk.download('brown')
from nltk.corpus import brown # nltk may need to be first installed with pip install nltk
[nltk_data] Downloading package brown to /root/nltk_data... [nltk_data] Package brown is already up-to-date!
Then we need to create a list of words to choose from, since the brown corpus has millions of words - and we only need words with five letters. We also want to avoid proper names.
words = []
for x in brown.words():
if len(x) == 5 and re.search(r"^A-Z|[\.,]", x) is None:
words.append(x.upper()) # We harmonise all words with caps
word = random.choice(words) # We pick a random word
Then we create a function that will embody the algorithm needed to play the game. That function will take as an input/argument the word guessed by the player.
The first few steps are to check whether that word can even qualify for the game: if it is 5-letter long, and is part of the existing corpus.
Then, if this is the case, we iterate over the letters of the guessed word, one by one, and we check three cases:
- if the letter is in the target word and at the same right place (so, same index), we color it green;
- if the letter is in the target word, but not at the right place, we color it yellow; and
- if the letter is not in the target word, we color it grey.
def play(answer): # We create a function that returns all words in a given format depending on how close we are from the right answer
answer = answer.upper() # Get the all caps version of the word to compare with dataset of words
if len(answer) > 5: # We first check that the input word in answer fits the requirement: be 5 in len, and in the dataset
print("Too long")
elif len(answer) < 5:
print("Too Short")
elif answer not in words:
print("Word does not exist")
else: # If this is a proper guess, we proceed to the main part of the function
for e, letter in enumerate(answer): # The function enumerate allows you to iterate over a list together with the index
if letter in word and answer[e] == word[e]: # If the letter is in the word and at the exact same place, we return a green square
print('\x1b[1;30;42m' + letter + '\x1b[0m', end=" ")
elif letter in word: # If it is in the word, but at a different place, we return a yellow square
print('\x1b[1;30;43m' + letter + '\x1b[0m', end=" ")
else: # Otherwise we just return the letter
print(letter, end=" ")
play("Tolls")
T O L L S
List Comprehension¶
Python code is a good middle ground between very verbose code (VBA for instance), and languages that are perfectly opaque to the neophyte. When you look at the syntax, given a few basics, you can have a rough idea of what's happening.
The issue with verbosity, however, is that it take space and time. If you need to populate a list from another list given a condition, you have now learned that you can use a loop and a conditional statement to perform the operation. But again, it can be cumbersome to write down all of this.
Enter list comprehensions, which is a way to create a list in a single line. The syntax is of the kind:
[x for x in list]
So, what you are trying to do is to invoke every element in the list, and operate over it to create a new list (hence the brackets around the statement).
Take for instance these three lines, which add numbers to a list after tripling then.
my_list = []
for x in range(1,25):
my_list.append(x * 3)
print(my_list)
This can be rewritten as a list comprehension in line with the syntax above
new_list = [x * 3 for x in range(1,25)]
print(new_list)
Note that the power of this method comes from the fact that you can go much further than the bare statement I gave you here. in particular, you can add conditions. For instance, let's say we are looking for every even number in a list of numbers.
even_list = []
for x in my_list:
if x % 2 == 0: # The modulo operator, using the percent symbol, returns the remainder of a division. Every even number's
# remainder is always 0
even_list.append(x)
print(even_list)
And this is the same list created with a list comprehension.
new_even_list = [x for x in new_list if x % 2 == 0]
print(new_even_list)
Note that you can add conditions, and the usual and
, or
, and None
commands or booleans work in this context as
well.
Finally, the first item in the list can also be operated upon. Let's say we now want the even numbers from my_list
,
except times three and in a string that starts with "Number: ".
even_more_new_list = ["Number : " + str(x * 3) for x in new_list if x % 2 == 0]
print(even_more_new_list)