Preparation¶
pip install gdown==v4.6.3
Collecting gdown==v4.6.3 Downloading gdown-4.6.3-py3-none-any.whl.metadata (4.4 kB) Requirement already satisfied: filelock in /usr/local/lib/python3.11/dist-packages (from gdown==v4.6.3) (3.17.0) Requirement already satisfied: requests[socks] in /usr/local/lib/python3.11/dist-packages (from gdown==v4.6.3) (2.32.3) Requirement already satisfied: six in /usr/local/lib/python3.11/dist-packages (from gdown==v4.6.3) (1.17.0) Requirement already satisfied: tqdm in /usr/local/lib/python3.11/dist-packages (from gdown==v4.6.3) (4.67.1) Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.11/dist-packages (from gdown==v4.6.3) (4.12.3) Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.11/dist-packages (from beautifulsoup4->gdown==v4.6.3) (2.6) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown==v4.6.3) (3.4.1) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown==v4.6.3) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown==v4.6.3) (2.3.0) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown==v4.6.3) (2024.12.14) Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /usr/local/lib/python3.11/dist-packages (from requests[socks]->gdown==v4.6.3) (1.7.1) Downloading gdown-4.6.3-py3-none-any.whl (14 kB) Installing collected packages: gdown Attempting uninstall: gdown Found existing installation: gdown 5.2.0 Uninstalling gdown-5.2.0: Successfully uninstalled gdown-5.2.0 Successfully installed gdown-4.6.3
import os
import requests
import zipfile
# Create the target directory if it doesn't exist
target_directory = "data/CE"
os.makedirs(target_directory, exist_ok=True)
# Direct URL to the zip file
zips = ['https://opendata.justice-administrative.fr/DCE/2021/06/CE_202106.zip',
'https://opendata.justice-administrative.fr/DCE/2021/07/CE_202107.zip',
'https://opendata.justice-administrative.fr/DCE/2021/09/CE_202109.zip',
'https://opendata.justice-administrative.fr/DCE/2021/10/CE_202110.zip',
'https://opendata.justice-administrative.fr/DCE/2021/11/CE_202111.zip',
'https://opendata.justice-administrative.fr/DCE/2021/12/CE_202112.zip',
'https://opendata.justice-administrative.fr/DCE/2022/01/CE_202201.zip',
'https://opendata.justice-administrative.fr/DCE/2022/02/CE_202202.zip',
'https://opendata.justice-administrative.fr/DCE/2022/03/CE_202203.zip',
'https://opendata.justice-administrative.fr/DCE/2022/04/CE_202204.zip',
'https://opendata.justice-administrative.fr/DCE/2022/05/CE_202205.zip',
'https://opendata.justice-administrative.fr/DCE/2022/06/CE_202206.zip',
'https://opendata.justice-administrative.fr/DCE/2022/07/CE_202207.zip',
'https://opendata.justice-administrative.fr/DCE/2022/08/CE_202208.zip',
'https://opendata.justice-administrative.fr/DCE/2022/09/CE_202209.zip',
'https://opendata.justice-administrative.fr/DCE/2022/10/CE_202210.zip',
'https://opendata.justice-administrative.fr/DCE/2022/11/CE_202211.zip',
'https://opendata.justice-administrative.fr/DCE/2022/12/CE_202212.zip',
'https://opendata.justice-administrative.fr/DCE/2023/01/CE_202301.zip',
'https://opendata.justice-administrative.fr/DCE/2023/02/CE_202302.zip',
'https://opendata.justice-administrative.fr/DCE/2023/03/CE_202303.zip',
'https://opendata.justice-administrative.fr/DCE/2023/04/CE_202304.zip',
'https://opendata.justice-administrative.fr/DCE/2023/05/CE_202305.zip',
'https://opendata.justice-administrative.fr/DCE/2023/06/CE_202306.zip',
'https://opendata.justice-administrative.fr/DCE/2023/07/CE_202307.zip',
'https://opendata.justice-administrative.fr/DCE/2023/08/CE_202308.zip',
'https://opendata.justice-administrative.fr/DCE/2023/09/CE_202309.zip',
'https://opendata.justice-administrative.fr/DCE/2023/10/CE_202310.zip',
'https://opendata.justice-administrative.fr/DCE/2023/11/CE_202311.zip',
'https://opendata.justice-administrative.fr/DCE/2023/12/CE_202312.zip']
# Download the zip files
for zip_url in zips:
print(zip_url)
zip_response = requests.get(zip_url)
zip_filename = "downloaded_file.zip"
with open(zip_filename, 'wb') as f:
f.write(zip_response.content)
# Unzip the file into the target directory
with zipfile.ZipFile(zip_filename, 'r') as zip_ref:
zip_ref.extractall(target_directory)
# Remove the downloaded zip file if needed
os.remove(zip_filename)
https://opendata.justice-administrative.fr/DCE/2021/06/CE_202106.zip https://opendata.justice-administrative.fr/DCE/2021/07/CE_202107.zip https://opendata.justice-administrative.fr/DCE/2021/09/CE_202109.zip https://opendata.justice-administrative.fr/DCE/2021/10/CE_202110.zip https://opendata.justice-administrative.fr/DCE/2021/11/CE_202111.zip https://opendata.justice-administrative.fr/DCE/2021/12/CE_202112.zip
--------------------------------------------------------------------------- TimeoutError Traceback (most recent call last) /usr/local/lib/python3.11/dist-packages/urllib3/connection.py in _new_conn(self) 197 try: --> 198 sock = connection.create_connection( 199 (self._dns_host, self.port), /usr/local/lib/python3.11/dist-packages/urllib3/util/connection.py in create_connection(address, timeout, source_address, socket_options) 84 try: ---> 85 raise err 86 finally: /usr/local/lib/python3.11/dist-packages/urllib3/util/connection.py in create_connection(address, timeout, source_address, socket_options) 72 sock.bind(source_address) ---> 73 sock.connect(sa) 74 # Break explicitly a reference cycle TimeoutError: [Errno 110] Connection timed out The above exception was the direct cause of the following exception: ConnectTimeoutError Traceback (most recent call last) /usr/local/lib/python3.11/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw) 786 # Make the request on the HTTPConnection object --> 787 response = self._make_request( 788 conn, /usr/local/lib/python3.11/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length) 487 new_e = _wrap_proxy_error(new_e, conn.proxy.scheme) --> 488 raise new_e 489 /usr/local/lib/python3.11/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length) 463 try: --> 464 self._validate_conn(conn) 465 except (SocketTimeout, BaseSSLError) as e: /usr/local/lib/python3.11/dist-packages/urllib3/connectionpool.py in _validate_conn(self, conn) 1092 if conn.is_closed: -> 1093 conn.connect() 1094 /usr/local/lib/python3.11/dist-packages/urllib3/connection.py in connect(self) 703 sock: socket.socket | ssl.SSLSocket --> 704 self.sock = sock = self._new_conn() 705 server_hostname: str = self.host /usr/local/lib/python3.11/dist-packages/urllib3/connection.py in _new_conn(self) 206 except SocketTimeout as e: --> 207 raise ConnectTimeoutError( 208 self, ConnectTimeoutError: (<urllib3.connection.HTTPSConnection object at 0x7d337a4e4250>, 'Connection to opendata.justice-administrative.fr timed out. (connect timeout=None)') The above exception was the direct cause of the following exception: MaxRetryError Traceback (most recent call last) /usr/local/lib/python3.11/dist-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies) 666 try: --> 667 resp = conn.urlopen( 668 method=request.method, /usr/local/lib/python3.11/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw) 840 --> 841 retries = retries.increment( 842 method, url, error=new_e, _pool=self, _stacktrace=sys.exc_info()[2] /usr/local/lib/python3.11/dist-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace) 518 reason = error or ResponseError(cause) --> 519 raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type] 520 MaxRetryError: HTTPSConnectionPool(host='opendata.justice-administrative.fr', port=443): Max retries exceeded with url: /DCE/2021/12/CE_202112.zip (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7d337a4e4250>, 'Connection to opendata.justice-administrative.fr timed out. (connect timeout=None)')) During handling of the above exception, another exception occurred: ConnectTimeout Traceback (most recent call last) <ipython-input-3-4d508af3c8a4> in <cell line: 0>() 42 for zip_url in zips: 43 print(zip_url) ---> 44 zip_response = requests.get(zip_url) 45 zip_filename = "downloaded_file.zip" 46 with open(zip_filename, 'wb') as f: /usr/local/lib/python3.11/dist-packages/requests/api.py in get(url, params, **kwargs) 71 """ 72 ---> 73 return request("get", url, params=params, **kwargs) 74 75 /usr/local/lib/python3.11/dist-packages/requests/api.py in request(method, url, **kwargs) 57 # cases, and look like a memory leak in others. 58 with sessions.Session() as session: ---> 59 return session.request(method=method, url=url, **kwargs) 60 61 /usr/local/lib/python3.11/dist-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json) 587 } 588 send_kwargs.update(settings) --> 589 resp = self.send(prep, **send_kwargs) 590 591 return resp /usr/local/lib/python3.11/dist-packages/requests/sessions.py in send(self, request, **kwargs) 701 702 # Send the request --> 703 r = adapter.send(request, **kwargs) 704 705 # Total elapsed time of the request (approximately) /usr/local/lib/python3.11/dist-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies) 686 # TODO: Remove this in 3.0.0: see #2811 687 if not isinstance(e.reason, NewConnectionError): --> 688 raise ConnectTimeout(e, request=request) 689 690 if isinstance(e.reason, ResponseError): ConnectTimeout: HTTPSConnectionPool(host='opendata.justice-administrative.fr', port=443): Max retries exceeded with url: /DCE/2021/12/CE_202112.zip (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7d337a4e4250>, 'Connection to opendata.justice-administrative.fr timed out. (connect timeout=None)'))
import gdown
# Google Drive URL for the .txt file
gdrive_url = "https://drive.google.com/uc?id=1-51EmCMxF6nlI5jb0XuPfWuwcMVXfHSd"
gdrive_url2 = "https://drive.google.com/uc?id=1HFPH4NuPJ0dAg3uuROjjFTP5PTTt_3X4"
# Create the target directory for .txt content if it doesn't exist
target_directory_txt = "/content"
# Download the .txt file into the target directory
gdown.download(gdrive_url, os.path.join(target_directory_txt, "poem.txt"), quiet=False)
gdown.download(gdrive_url2, os.path.join(target_directory_txt, "Example.pdf"), quiet=False)
!pip install pymupdf
Downloading... From: https://drive.google.com/uc?id=1-51EmCMxF6nlI5jb0XuPfWuwcMVXfHSd To: /content/poem.txt 100%|██████████| 1.73k/1.73k [00:00<00:00, 1.66MB/s] Downloading... From: https://drive.google.com/uc?id=1HFPH4NuPJ0dAg3uuROjjFTP5PTTt_3X4 To: /content/Example.pdf 100%|██████████| 209k/209k [00:00<00:00, 25.7MB/s]
Collecting pymupdf Downloading pymupdf-1.25.2-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB) Downloading pymupdf-1.25.2-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.0/20.0 MB 49.8 MB/s eta 0:00:00 Installing collected packages: pymupdf Successfully installed pymupdf-1.25.2
Manipulating Files¶
This course is about data analysis, not software creating. You'll mostly use Colab or the Console, and won't
particularly need to write .py
scripts, invent new classes, etc. Scripts can record methods and functions you invented, but most analyses are transient: you use it on the go, sometimes for a single task, and get to something else.
And yet, the data to be analysed needs to be stored. We'll now see the basics of this.
Navigating Files¶
But first a word about how computers work. You've got memory: your hard drive. You've got RAM: Random Access Memory, which is what your computer use to perform its basic tasks. There is a constant va et vient between the two. When you create a new variable in the Console, it's stored in RAM; if you want to use it next time, you need to store it in the secondary, hard memory.
You can do that through Python, but you need an interface between the code and your computer
environment. Fortunately, Python relies on the same kind of methods that are at the basis of most computers (since
they get their roots in the UNIX system). If you open the Command Prompt (cmd
on Windows, or the Terminal on a Mac),
you can navigate between folders with the command cd
(for "Change directory"), or create a new folder with mkdir
.
Three commands in particular will be useful during this course:
- os.getcwd(), which means "Get Current Working Directory", outputs the current position of Python within your files
- os.chdir(x), which "changes dir" to the directory x you specify in argument (x needs to be a subfolder of the current folder: you cannot directly change dir to a sub-sub-folder, for instance).
- os.listdir("."), which returns a list of files in a directory (using the "." argument means: in the current directory.)
(Note that the argument we gave to the listdir() method was ".", which usually means "this current folder", whereas ".." always means "parent folder".)
import os
current_path = os.getcwd()
print(current_path)
os.chdir("..")
new_path = os.getcwd()
print(new_path)
print(os.listdir(".")) # We check what files and subfolders are in the folder
os.chdir("content")
print(os.listdir("."))
/content / ['lib32', 'var', 'sys', 'etc', 'run', 'lib64', 'media', 'boot', 'root', 'dev', 'libx32', 'usr', 'srv', 'mnt', 'bin', 'lib', 'tmp', 'sbin', 'proc', 'home', 'opt', 'content', '.dockerenv', 'tools', 'datalab', 'python-apt', 'python-apt.tar.xz', 'NGC-DL-CONTAINER-LICENSE', 'cuda-keyring_1.0-1_all.deb'] ['.config', 'Example.pdf', 'data', 'poem.txt', 'sample_data']
Exercice¶
I downloaded a bunch of decisions from the Conseil d'Etat in a folder named CE
, in the subfolder Data
, itself in the subfolder content
. How many decisions are there ?
# Your code here
os.chdir("/content/data/CE")
print(os.getcwd())
files = os.listdir(".")
print(len(files))
/content/data/CE 18358
Renaming Files¶
os
is also very helpful to manipulate files from Python, for instance renaming them. Instead of spending hours
renaming hundreds of files (a common thing for junior lawyers), you can do it with the os.rename(x, y)
method, which changes file x into y.
files[0]
'DCE_448413_20210927.xml'
file = "DCE_448413_20210927.xml" # We select a file that is in the folder Data and attribute its name to a variable
newnamefile = file + "a" # We decide on a new name to give that file
os.rename(file, newnamefile)
print("The new file list is: ", [x for x in os.listdir(".") if "DCE" in x])
os.rename(newnamefile, newnamefile[:-1]) # We repair what we did
The new file list is: ['DCE_448985_20210927.xml', 'DCE_439145_20210928.xml', 'DCE_447452_20210927.xml', 'DCE_438042_20210928.xml', 'DCE_445700_20210927.xml', 'DCE_448691_20210927.xml', 'DCE_448569_20210927.xml', 'DCE_440983_20210927.xml', 'DCE_450316_20210927.xml', 'DCE_440190_20210927.xml', 'DCE_446572_20210927.xml', 'DCE_449032_20210927.xml', 'DCE_449779_20210927.xml', 'DCE_440987_20210928.xml', 'DCE_447242_20210927.xml', 'DCE_450148_20210927.xml', 'DCE_449151_20210927.xml', 'DCE_449713_20210927.xml', 'DCE_439696_20210928.xml', 'DCE_442455_20210927.xml', 'DCE_445848_20210927.xml', 'DCE_447625_20210928.xml', 'DCE_446727_20210927.xml', 'DCE_438009_20210927.xml', 'DCE_445885_20210927.xml', 'DCE_449511_20210927.xml', 'DCE_437650_20210928.xml', 'DCE_436740_20210927.xml', 'DCE_445388_20210927.xml', 'DCE_446804_20210927.xml', 'DCE_450582_20210927.xml', 'DCE_448389_20210927.xml', 'DCE_448413_20210927.xmla', 'DCE_448751_20210927.xml', 'DCE_443825_20210927.xml', 'DCE_449502_20210927.xml', 'DCE_449778_20210927.xml']
Note that the resulting file is now corrupted: you changed the extension from txt
to .txta
, which is unknown -
and so you can't read it anymore. There are ways to control for this, but this is the subject of the exercise.
"DCE_448413_20210927.xmla" in files
False
Exercices¶
Rename all DCE files with "Decision" (instead of DCE), and all ORCE files with "Ordonnance".
Count all decisions ("DCE") taken before September 16, 2023; do the same for all ordonnances ("ORCE").
Each decision has a case number, distinct from the date. Compute the total of all these case numbers. (Remember that you can change a string to a number with
int()
.)
files
#1
os.chdir("/content/data/CE")
files = os.listdir(".")
for x in files:
if "DCE" in x:
#
newnamefile = "Decision" + x[3:]
os.rename(x, newnamefile)
if "ORCE" in x:
os.rename(x, x.replace("ORCE", "Ordonnance"))
#2
ii = 0
for x in files:
date = int(x.split("_")[-1][:-4])
if date < 20230916:
ii += 1
print(ii)
3794
#3
total = 0
for x in files:
number = int(x.split("_")[1])
total += number
print(total)
1709707636
Load and Store data¶
Anyhow, back to storing data. The canonical way to do it in Python is by opening a file and storing it in a variable. This variable (or object) comes with distinct methods, such as "read()" which returns the data inside the file. Another method is "write()", which allows you to add to the existing data.
To create a new file, you'd use the with syntax with open("your_file_name.txt", "a") as f:
; and then, in the
indented part of the code, you use the "write()" method of the "f" object to add your text to the data.
Note the "a"argument, which means that you want to both read and write in the file (you could input "r" only for reading, or "w" only for writing).
*** You may want to learn a bit about encoding, and the Unipain***
f = open("/content/poem.txt", encoding="latin1")
poem = f.read()
print(poem)
f.close()
with open("/content/poem2.txt", "a", encoding="utf8") as f:
f.write(poem)
os.listdir("/content") # We check that we indeed saved a new "poem2" file in the Data folder
This is text data, arguably the most straightforward type of data to handle. As we'll see in the next
task, however, a lot of what you'll be handling is structured data: either in marked-up format (XML, HTML), or in
some kind of spreadsheet. You probably know Excel's native .xls
format for spreadsheet, but there are plenty
others, and a good deal of data analysis relies on a simple format called .csv
- which stands for comma-separated
value.
Depending on what you do, you might not need to rely on these methods much: we'll see at some point how to handle data with pandas, which has its own, more straightforward methods to save and load data. Likewise, XML tools in Python typically have their own methods.
PDFs¶
Beyond text, .csv, and structured content, data is sometimes enclosed in .pdfs. Now, this is an issue: . pdfs are not meant for data analysis, their (main) interest, and the reason why they have been invented, is to be a format that preserves a maximum the appearance of a file, on all platforms. But this is not a data-friendly format.
Unfortunately, a lot of data out here is found in .pdfs, so you'll have to wrestle with them to extract their data.
For this, you'll need to use a third-party library dedicated to .pdfs files, such as pyPDF2
or pdfminer
. But the
principle is the same: you open your .pdf and store it in an object, and then you use methods from this object (they
would differ depending on the package used) to obtain the data you are interested in.
import pymupdf # PyMuPDF, called fitz for legacy reasons
pdf = fitz.open("/content/Example.pdf") # Open the .pdf file
num_pages = pdf.page_count
text = ""
for page in range(0, num_pages):
page_obj = pdf.load_page(page)
text += page_obj.get_text("text") # Extract text from the page
print(text[:500]) # Display the first 500 characters of the extracted text
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) <ipython-input-4-fc60d19596ff> in <cell line: 0>() ----> 1 import pymupdf # PyMuPDF, called fitz for legacy reasons 2 3 pdf = fitz.open("/content/Example.pdf") # Open the .pdf file 4 num_pages = pdf.page_count 5 text = "" ModuleNotFoundError: No module named 'pymupdf' --------------------------------------------------------------------------- NOTE: If your import is failing due to a missing package, you can manually install dependencies using either !pip or !apt. To view examples of installing some common dependencies, click the "Open Examples" button below. ---------------------------------------------------------------------------
XML and Structured Files¶
When you want to store and work with data, pure text is not very helpful; for a start, pure text usually does not include the formatting (bold, italic, etc.), and contains no info as to the role of a particular part of the text (for instance, in a judgment, the difference between the arguments of the parties, the reasoning, or the dispositif.
The solution here is to store your text into a file that follows a structure, according to a particular language. XML, for "Extensible Markup Language", is a structured language. HTML is another.
Likewise, a .docx
, when you go into the details, is actually a text file with a layer of structure that tells Microsoft words a number of information as to the formatting of that text. Here is an example of the difference between the two: this
is the same part of a MSWord document, except the second is the internal .xml structure (after a bunch of
manipulations on my part to make it somewhat readable).
So back to .xml. In your Files, we have a number of decisions by the Conseil d'Etat released as part of their Open Data program. They are .xml files. They are not great, but they'll do.
Let's have a look at one of these files, as it appears if you open it with a browser. You can see that the main text is divided between what we call elements. Each element includes an opening tag
(or "balise", in French), which must be accompanied by a closed tag of the same name. Tags and sections cannot overlap: when you open a tag in a context,
you need to close it in that context. (You can also have self-standing, one-tag elements, of the form
The documents from the Conseil d'Etat don't have much of those, but normally you can specify further attributes
for each element: these are data points that will not be seen by a natural reader (unless you look at the code directly), but enclose
further information (such as formatting, or a URL for a link) for the software, or data scientist, who is probing this
data. A good example is the <a>
, which represents a link, and always has an attribute href
, which
is the url:
\<\a href="My URL Here">My link here\<\/a> # The antislash here was added, so that you can see the structure;
otherwise the element would not appear
You can also see, hopefully, that the information is enclosed in a hierarchical format, like a tree: you start with
the root, and then you get branches that can get branches of their own, etc. Here everything is enclosed in a
Document
element, itself part of anxml
element. Yet Document
has only four direct children, which themselves
have further children.
"Children" is the usual term, though "descendants" is also sometimes used. Logically, you also have "parents" or "siblings".
The interest of storing data in a structured format is not only that you can include more than data (such as metadata), but also that, once you know the structure, you can extract data efficienty from all files that follow that format. The Conseil d'Etat decided a few years ago to release all their judgments according to that format, and code that worked to extract data from judgments back then also works for new judgments - as long as they follow the structure.
In other words, just like using a loop over the content of a list allows you to be agnostic about the data in that list, having a structure allows you to be agnostic about the data that was filled in that structure.
For instance, Let's say we want to collect all dates from these decisions from the Conseil d'Etat. Instead of searching
each text for a date, the .xml format is helpful: we can see that the date is enclosed in an element called
Date_Lecture
. We can just iterate over all files, and collect the dates.
The first thing to understand is that when you parse an .xml document, you need to start from the root. From there, you typically iterate over their descendants, sometimes by specifying a condition: for
instance, we can look for all <p>
elements, which represent the paragraphs. You also have various levels of
iterations: over siblings, children, or ancestors. Another alternative is to go through all descendants and check
if they are of the required type.
import pandas as pd
from lxml import etree # This is one of the main .xml reader module in Python,
# the etree method from the lxml package. You need to : pip install lxml
import os
from datetime import datetime
from collections import defaultdict, Counter
os.chdir("/")
os.chdir("content/data/CE") # We go to the main folder that stores all files
files = os.listdir(".")
print(len(files)) # There are many files !
file = files[0] # Let's work on the first file to get an example
xml_file = etree.parse(file) # We first open the .xml file with the "parse" method
root = xml_file.getroot() # We then look for the "root" of the XML tree, and pass it to a variable root
print(root) # You can check the attributes of every element this way
print("Text of the element: " + root.text) # Likewise, the "text" attribute gives you the text inside an element;
# root has no text, as you can see everything is in the elements instead
3794 <Element Document at 0x7d334b96e740> Text of the element:
print(root)
root.text
<Element Document at 0x7d21208af340>
'\n'
Now, starting from the root, we can go through all its children and grandchildren. There are several ways to do this.
for child in root: # The parent element also works as a list of its children element,
#so you can easily iterate over it immediately like this
print(child.tag, ":", child.text)
for subchild in child:
print(subchild.tag, ":", subchild.text)
Donnees_Techniques : Identification : DCE_461370_20220719.xml Date_Mise_Jour : 2022-07-20 Dossier : Code_Juridiction : CE Nom_Juridiction : Section du Contentieux Numero_Dossier : 461370 Date_Lecture : 2022-07-19 Numero_ECLI : ECLI:FR:CECHS:2022:461370.20220719 Type_Decision : Décision Type_Recours : Plein contentieux Code_Publication : D Solution : Rejet PAPC Audience : Date_Audience : 2022-06-16 Numero_Role : 22382 Formation_Jugement : 8ème chambre jugeant seule Decision : Texte_Integral : None
for el in root.iter("p"): # Though a better way to do it is with iter();
# this command takes arguments that allow you to filter the descendants
#print(el.tag) # This will return the text of the decision, paragraph by paragraph
print(el.text)
Vu la procédure suivante : M. A B a demandé au juge des référés du tribunal administratif de Paris, statuant sur le fondement de l'article L. 521-2 du code de justice administrative, de suspendre l'organisation en présentiel des examens du premier semestre du centre de préparation aux concours de la haute fonction publique de l'université Paris 1 Panthéon-Sorbonne prévus à compter du 3 janvier 2022 ou, à défaut, d'enjoindre à l'université d'organiser ces examens à distance ou, à titre encore subsidiaire, de réexaminer les modalités d'organisation de ces examens. Par une ordonnance n° 2128296 du 1er janvier 2022, le juge des référés du tribunal administratif de Paris a rejeté sa demande. I. Sous le n° 460051, par une requête enregistrée le 1er janvier 2022 au secrétariat du contentieux du Conseil d'Etat, M. B demande au juge des référés du Conseil d'Etat, statuant sur le fondement de l'article L. 521-2 du code de justice administrative : 1°) d'annuler cette ordonnance ; 2°) de suspendre l'organisation en présentiel des examens du premier semestre du centre de préparation aux concours de la haute fonction publique de l'université Paris 1 Panthéon-Sorbonne prévus à compter du 3 janvier 2022 ou, à défaut, d'enjoindre à l'université d'organiser ces examens à distance ou, à titre encore subsidiaire, de réexaminer les modalités d'organisation de ces examens ; 3°) de mettre à la charge de l'université Paris 1 Panthéon-Sorbonne une somme de 3 000 euros au titre de l'article L. 761-1 du code de justice administrative. Il soutient que : - la condition d'urgence est satisfaite compte tenu de l'imminence de la session des examens du premier semestre, qui doit se dérouler du 3 au 7 janvier 2022 et réunir plus de 150 étudiants dans un même amphithéâtre ; - il est porté une atteinte grave et manifestement illégale au droit à la vie, au droit à la protection de la santé ainsi qu'au principe de précaution, compte tenu du niveau du taux d'incidence atteint à Paris, de l'ordre de 2 000 cas pour 100 000 habitants, de l'obligation faite aux étudiants, y compris s'ils sont positifs ou cas contact, de se rendre à ces examens pour lesquels aucune session de rattrapage n'est prévue, de la certitude que des étudiants contagieux s'y rendront en l'absence de tout contrôle du passe sanitaire et de la configuration de l'amphithéâtre faisant office de salle d'examen, qui est dépourvu de fenêtres et ne permet pas le respect des règles de distanciation ; - le juge des référés du tribunal administratif a entaché son ordonnance d'une erreur de droit, au regard de l'article L. 712-6-1 du code de l'éducation, en jugeant que la présidente de l'université pouvait, sans saisir au préalable la commission de la formation et de la vie universitaire, se substituer à cette commission pour fixer les règles relatives aux examens et, en l'espèce, décider l'organisation d'une session de rattrapage ; - l'ordonnance est également entachée d'irrégularité faute de viser son mémoire en réplique, ainsi que de dénaturation des pièces produites par l'université sur la ventilation des amphithéâtres. II. Sous le n° 460052, par des productions enregistrées le 1er janvier 2022 au secrétariat du contentieux du Conseil d'Etat, M. B reprend les conclusions et les moyens de sa requête enregistrée sous le numéro 460051. Vu les autres pièces du dossier ; Vu : - la Constitution ; - la convention européenne de sauvegarde des droits de l'homme et des libertés fondamentales ; - le code de l'éducation ; - l'ordonnance n° 2020-1694 du 24 décembre 2020 ; - le décret n° 2021-699 du 1er juin 2021 ; - le code de justice administrative ; Considérant ce qui suit : 1. Aux termes de l'article L. 521-2 du même code : " Saisi d'une demande en ce sens justifiée par l'urgence, le juge des référés peut ordonner toutes mesures nécessaires à la sauvegarde d'une liberté fondamentale à laquelle une personne morale de droit public ou un organisme de droit privé chargé de la gestion d'un service public aurait porté, dans l'exercice d'un de ses pouvoirs, une atteinte grave et manifestement illégale. () ". En vertu de l'article L. 522-3 du même code, le juge des référés peut, par une ordonnance motivée, rejeter une requête sans instruction ni audience lorsque la condition d'urgence n'est pas remplie ou lorsqu'il apparaît manifeste, au vu de la demande, que celle-ci ne relève pas de la compétence de la juridiction administrative, qu'elle est irrecevable ou qu'elle est mal fondée. 2. M. B, étudiant en " classe préparatoire Talents " rattachée au centre de préparation aux concours de la haute fonction publique de l'université Paris 1 Panthéon-Sorbonne, a demandé au juge des référés du tribunal administratif de Paris, statuant sur le fondement de l'article L. 521-2 du code de justice administrative, de suspendre, dans l'attente de la réunion de la commission de la formation et de la vie universitaire prévue le 11 janvier 2022, la tenue des examens du premier semestre organisés en présentiel entre le 3 et le 7 janvier 2022 par le centre de préparation aux concours de la haute fonction publique ou, à défaut, d'enjoindre à l'université d'organiser ces examens à distance ou, à titre encore subsidiaire, de réexaminer les modalités d'organisation de ces examens. Il relève appel de l'ordonnance du 1er janvier 2022 par laquelle le juge des référés du tribunal administratif de Paris a rejeté sa demande. 3. Les productions enregistrées sous le numéro 460052 constituent un doublon de la requête de M. B enregistrée sous le numéro 460051. Elles doivent par suite être rayées des registres du secrétariat du contentieux du Conseil d'Etat. 4. Il résulte de l'instruction diligentée par le juge des référés du tribunal administratif de Paris que les étudiants participant aux examens organisés entre le 3 et le 7 janvier 2022 par le centre de préparation aux concours de la haute fonction publique de l'université Paris 1 Panthéon-Sorbonne, qui appartiennent à une classe d'âge dont le taux de vaccination est supérieur à 90 %, devront porter le masque pendant toute la durée des épreuves, auront accès à des produits hydro-alcooliques mis à leur disposition et pourront composer dans des conditions permettant le respect des règles de distanciation. 5. Par ailleurs, il résulte également de l'instruction diligentée par le juge des référés du tribunal administratif que la présidente de l'université s'est engagée le 28 décembre 2021 à organiser une session de rattrapage pour les étudiants positifs ou cas contact, soumis à l'isolement. En se fondant sur l'ordonnance du 24 décembre 2020 relative à l'organisation des examens et concours pendant la crise sanitaire, dont l'article 4 dispose que les adaptations nécessaires sont arrêtées par le chef d'établissement lorsque l'organe collégial compétent ne peut délibérer dans des délais compatibles avec la continuité du service, pour juger que la présidente de l'université avait pu prendre une telle décision sans attendre la réunion, le 11 janvier 2022, de la commission de la formation et de la vie universitaire, en principe seule compétente en vertu de l'article L. 712-6-1 du code de l'éducation pour adopter les règles relatives aux examens, le juge des référés du tribunal administratif n'a pas commis d'erreur de droit. 6. A l'appui de sa requête devant le juge des référés du Conseil d'Etat, M. B reprend en outre des éléments exposés en première instance dans des mémoires produits après l'audience. Ces mémoires ayant été enregistrés après l'heure de la clôture de l'instruction, le juge des référés du tribunal administratif n'a pas entaché son ordonnance d'irrégularité en les visant comme des notes en délibéré. Si le requérant conteste les allégations de l'université Paris 1 Panthéon-Sorbonne sur l'existence d'un mécanisme de ventilation permettant de renouveler l'air de l'amphithéâtre faisant office de salle d'examen, qui est dépourvu de fenêtres, la circonstance que le dispositif de ventilation ne comporterait pas un système de recyclage et de refroidissement ne saurait en tout état de cause suffire, dans les conditions décrites aux points précédents, à caractériser une atteinte manifestement illégale au droit à la vie et à la protection de la santé. 7. Il résulte de ce qui précède que M. B n'est pas fondé à soutenir que c'est à tort que, par l'ordonnance attaquée, le juge des référés du tribunal administratif de Paris a rejeté sa demande. Ses conclusions d'appel, y compris celles présentées au titre de l'article L. 761-1 du code de justice administrative, doivent par suite être rejetées selon la procédure prévue à l'article L. 522-3 de ce code. O R D O N N E : ------------------ Article 1er : Les productions enregistrées sous le numéro 460052 seront rayées des registres du secrétariat du contentieux du Conseil d'Etat. Article 2 : La requête de M. B est rejetée. Article 3 : La présente ordonnance sera notifiée à M. A B et à l'université Paris 1 Panthéon-Sorbonne. Copie en sera adressée à la ministre de l'enseignement supérieur, de la recherche et de l'innovation. Fait à Paris, le 2 janvier 202 Signé : Suzanne von Coester4600513
for el in root.iter(["Numero_Dossier", "Date_Lecture"]):
# The filter can also be a list of relevant element element tags
print(el.tag)
print(el.text)
Numero_Dossier 460051 Date_Lecture 2022-01-02
Note also that you can navigate between the elements, to jump from elements to their parents, or siblings. This is very helpful if you know the tag of one element but aren't sure of what follows it; or if you want to work on several elements in line.
for el in root:
pass # An empty loop to make sure "el" is the last child of root
print("The last child of root is: ", el.tag)
prev_el = el.getprevious() # This method gets you the previous sibling
print(prev_el.tag)
next_el = prev_el.getnext()
print(next_el.tag)
subel = root.getchildren()[1]
print("The second child from the root is: ", subel)
print("Its parent is", subel.getparent())
root.getchildren()
[<Element Donnees_Techniques at 0x7d334b9bb740>, <Element Dossier at 0x7d334923f7c0>, <Element Decision at 0x7d334923d4c0>]
Now, coming back to our example, we want to get the date for every decision. Note that if we want to do it for one file, we just need to find the relevant element (tag = "Date_Lecture"), and extract the data from that element.
for el in root.iter("Date_Lecture"): # the Date_Lecture element contains the judgment's date;
# Easiest way in XML is to filter all descendants to get only the one we are interesting in
date = el.text
print(date)
Therefore, to obtain it from all judgments, we just need to loop over all files.
files = os.listdir(".")
for file in files[:10]: # Looping only over the first 10
xml_file = etree.parse(file) # We open each .xml file with the "parse" method
root = xml_file.getroot() # And we goot the root
for el in root.iter("Date_Lecture"): # the Date_Lecture element contains the judgment's date;
# Easiest way in XML is to filter all descendants to get only the one we are interesting in
date = el.text
print(date)
Exercise¶
Adapt the previous algorithm to find the most common "Type_Recours" for all decisions in the folder (use the Counter module described just below).
from collections import Counter
cc = Counter(["cake", "cake", "cake", "knife", "knif"]) # An example of using a counter
cc.most_common()
# Your code here
From XML to a DataFrame¶
Now, if we wanted to recreate a full database of all relevant data points in each judgment, we can just use the list of list method. This methods leverages the fact that a dataframe is nothing but a list of sublists of equal length, with each sublist being a row (see here for more details).
details = ["Numero_Dossier", "Date_Lecture", "Date_Audience", "Avocat_Requerant", "Type_Decision", "Type_Recours",
"Formation_Jugement", "Solution"] # All the relevant data points/elements in our judgments
lists_details = [] # Easiest way to create a dataframe is first to have a list of lists,
# and then pass it to pd.Dataframe(lists, columns=details)
for file in files:
newlist = [] # We create a new, empty sublist, every time we switch to a new file;
# that sublist will be filled with relevant data and added to main list; each sublist will have the same length
XML = etree.parse(file)
root = XML.getroot()
for detail in details: # For each file, we iterate over each type of detail, using a loop
result = ""
for el in root.iter(detail): # and we use this detail to filter from all descendants in root
result = el.text
newlist.append(result) # we then pass the result to the sublist created above
lists_details.append(newlist) # Before the loop concludes with one file and passes on to the next,
# we append the (filled) newlist to main list
df = pd.DataFrame(lists_details, columns = details) # Out of the loop, we create a dataframe based on that list of lists
df.head(10)
# df.to_clipboard(index=False) # Finally, we copy the DataFrame so as to paste it (CTRL+V) in Excel
len(df)
df.Type_Recours.value_counts()
Dates and xPath¶
Before turning to scraping, two important points that fit nowhere else. For both, we will use a decision from the Conseil d'Etat.
xPath¶
We saw how to locate an element by filtering all children from the root with the .iter()
method. Yet, this is not the easiest way to locate an element when you really need it. Instead, you need to use yet another syntax, called xpath. You can read more about xPath
here. It works like this:
- You first identify where to find the required element. You typically start from the source element
(represented by a dot
.
), then use one slash if you want to search in the immediate children, or two slashes (//
) if you need to search in the entire tree; - Then you specify the name of the element, or
*
if any would do; - And then you add conditions, in brackets, such as the value of an attribute (introduced by a
@
), or based on other functions (such as whether the element contains a certain text); - You can also directly looked for all "x" elements (will return a list of those);
- Finally, xpath comes with a number of functions, such as
contains()
(allows you to check that the object contains a certain text);
For instance, if we needed to find the element Date_Lecture
in the xml_file, this is what the xPath expression
would look like: root.xpath(".//Date_Lecture")
.
Xpath method Returns a list, be careful about this ! If you expect only one element, you can immediately index it, as below.
import os
from lxml import etree
file = os.listdir(".")[0] # We take the first file from the CE folder
xml_file = etree.parse(file) # We first open the .xml file with the "parse" method
root = xml_file.getroot() # We then look for the "root" of the XML tree, and pass it to a variable root
numero_dossier = root.xpath(".//Numero_Dossier")[0] # We search for the element Numero Dossier starting from the root
# (which is the "." here)
print("Le numéro de dossier est: ", numero_dossier.text)
paras = root.xpath(".//*[contains(text(), 'Article')]") # Looking for all elements whose text contains the term "Article"
for para in paras: # We can loop since xPath always returns a list!
print(para.text)
Dates¶
Python has a data format called datetime, which deals with dates. Dates can be text; in some cases, they can be numbers (e.g., a year); but they are most useful when they are of the type "datetime", since they then come with useful methods.
To transform a text into a datetime object, you need to parse it. The datetime module has a function strptime
that detects time according to a pattern. You can look for days, months, quarters, minutes, etc. For instance, the symbol "%Y" means the full year written as four consecutive digits (e.g., in regex, \d\d\d\d). The full syntax is available here.
Once you have that datetime object, you can act on it, for instance extract the month in the attributes.
from datetime import datetime # The relevant module in the package datetime is also called datetime ...
print(datetime.today()) # datetime knows what date it is today
date = root.xpath(".//Date_Lecture")[0] # We get the date of our decision
print(date.text)
parsed_date = datetime.strptime(date.text, "%Y-%m-%d") # The function strptime allows you to read a text (first argument),
# and if it matches the pattern in second argument, you will create a datetime object (parsed_date here).
print(parsed_date.day) # The day attribute knows the day number (in the month)
But more importantly, datetime objects allow you to reformat a date according to your needs - again, using a pattern.
full_date = parsed_date.strftime("%A %d %B %Y") # Your datetime object can then be transforme (strftime) into a more
# pleasant date format, again using a pattern. Note that datetime know what day of the week that date was !
print(full_date)
date.set("date", full_date) # Let's add the full date as an attribute to our date element
new_date_el = root.xpath(".//Date_Lecture[@date='" + full_date + "']")[0] # And now we can use xPath to find this element
# with the attribute (which we just added)
print("The element's attribute date: ", new_date_el.get("date"))
Dataframes¶
A lot of the analyses you'll be asked to perform will be based on a dataframe, i.e., a spreadsheet where data is structured in rows and columns.
There are other ways to store, access, and exploit data, but usually even that data is at one point converted into a dataframe, over which you'll perform (and record) your analyses.
The main and most popular module in Python for dataframes is called pandas
, and frequently abbreviated
as pd
. Think of pandas in this context as an equivalent of Microsoft Excel - except
infinitely more flexible (though you can do much more with Excel if you learn VBA, the language powering it).
We'll introduce you very softly to pandas today, keeping most of the heavy work for another lesson - but you'll need the basics to properly follow through the lessons on Scraping, for instance.
We already created a dataframe above, which we can save as a .csv
file.
Once saved, we can then load the file with pandas, which makes it easy for you for a dedicated read_csv
function.
import os
import pandas as pd # We import pandas
from matplotlib.pyplot import plot # We also import a module to create graphs and plots
import regex as re
os.chdir("/content/content/data")
df.to_csv("CE.csv", index=False, encoding="utf8")
df = pd.read_csv("CE.csv", header="infer") # We load it
df = df.fillna("") # It is common to fill the "empty" cells (which are cast as "NaN", or "Not a Number",
# replacing them with an empty string instead ("").
# This is to facilitate comparisons (as you can't compare a string to a NaN)
Then we can do a bit of data investigation, see what's the most interesting column or data, etc. One first useful tool is the .value_count()
method, which allow you to see the rough distribution of a variable.
df.head(5)
print(df.columns)
print(df["Type_Recours"] ) # You access a particular column by indexing it this way (we'll see further indexing in a few weeks)
df["Type_Recours"].unique() # Functions such as Unique renders a list of all possible values in a given column
df["Type_Recours"].value_counts() # One of the most useful functions returns a count of all values
df["Type_Recours"].value_counts().plot(kind='barh') # And now we can plot it with a bar chart
Like lists and dictionaries, a pandas object is a collection of data, and you may want to index it to find a particular data point. The methods are a bit different however.
To get a single value, you can use the method .at
, which takes two arguments: the row number, and the column index.
print(df.at[0, "Type_Recours"]) # This returns the value of the column "Type_Recours" for the first line of the df
print(df.at[20, "Avocat_Requerant"]) # The value for the column "Avocat_Requerant" for the 10th line
More commonly, Pandas objects have a method .loc
, which returns a slice of the dataframe based on a condition (which should be True), or .at
to get a particular cell if you know its index.
ddf = df.loc[df.Formation_Jugement == "Juge des référés"] # Filtering the dataframe to focus on all rows where the formation of
# judgment is the "Juge des Référées
len(ddf)
Again like with a list, you may want to loop over a dataframe to work on every data point one after another. Iterating over a dataframe is generally done with the help of the method iterrows
- which provides you with two elements (index, and a row that can be indexed with column names).
for index, row in df[:15].iterrows(): # We limit the loop to the first 15 rows
print(index, row["Formation_Jugement"], row["Avocat_Requerant"])
Keeping track of the index is very useful, if you want to change the value of a column, or populate a new column, as in the example below.
This is what is meant by "enriching" the data, using it to derive further measures or indicators.
df["Empty_Col"] = "" # Creating new column with empty text
df["New_Col"] = df["Formation_Jugement"].str.replace("jugeant seule", "").str.strip() # creating new column with
# data from another column, except we changed all strings (str) with empty text, and then stripping
df["Mixed"] = False # Another new column, with only False datapoints for now
for index, row in df.iterrows(): # We loop over the dataframe
if re.search("r.unies", row["Formation_Jugement"], re.I): # We check that the formation is made of chambres réunies
df.at[index, "Mixed"] = True # If this is the case, we reassign the column Mixed at the relevant index with True
df.Mixed.value_counts()
Finally (for now), note that you can always put your dataframe in the clipboard (i.e., a CTRL+C) so as to paste it (with CTRL+V) in a normal excel or csv file.
df.to_clipboard(index=False) # Does not work for Colab, though
# But keep it in mind for when you'll use Python on your laptop
Exercise¶
There is a limited group of lawyers who can appear before the CE. Is there a specialisation in this respect between Référés and other cases ?
# Your code here
# Get a list of référés lawyer
## Filter dataset to focus on
## For filtered cases, get names of lawyers
# Get a list of "usual" lawyers