Natural Language Processing has been an exciting buzzword for a while now. After hearing about it in anticipation for years, in a recent project it was required to extract named entities from a large number of news articles. It was a serverless Extract-Transform-Delivery pipeline hosted in AWS. The various stages of the pipelines were executed in lambda functions. In one of the stages, a text blob was given to spaCy and it would process the blob using its pre-trained model for English language. At the end of the process, all the entities in the input text along with a categorisation of each entity as people, organisation, geo-location, event etc. are output in a JSON document which is then uploaded into ElasticSearch. A section from the JSON output might look as below:
{
...
"ne_org": ["Indian Shipyards", "INS Arihant", "INS Aridhaman", "New Delhi Vayu Aerospace Review"],
"ne_gpe": ["SBC", "Visakhapatnam", "India"],
"ne_person": "",
"ne_norp": "Indian",
"ne_loc": "",
"ne_language": "English",
"ne_fac": ["MoD", "Ship Building Centre"],
"ne_work": "",
"ne_product": "",
"ne_event": "Wake of Series of Mishaps",
"ne_law": ""
...
}
spaCy is big
The first challenge was to fit it in a lambda. Lambdas have a limited memory allocation. Below are the memory limits from AWS documentation.
There is some leeway on these limits. Although here it says that the upper limit for zipped size is 50MB, that is only true if it is a direct upload. If the zip is imported from S3 instead, it is only required to adhere to the unzipped size. So if the unzipped size is less than 250MB, it is possible to import zip files bigger than 50MB.
Thankfully someone else has already packaged spaCy for an older version on python 2.7. I followed this git repo and created our own bootstrap script for creating a package for python 3.7 and the latest spaCy version. This is a standard approach to lambda function/layer packaging where you install all the dependencies in a virtual environment, remove unnecessary bits (optional) and upload the zip straight to lambda or to a S3 bucket. We used this approach to package lxml, pdfminer, beautiful soup and a few other libraries. Although the actual spaCy model is loaded separately at runtime, the lemma and grammatical rules and exceptions for each supported language is part of the spaCy library. I had to get rid of some language libraries as the ruleset and lemma have grown considerably from its previous version.
#!/bin/bash
yum -y update
yum -y upgrade
yum install -y gcc
yum install -y gcc-c++
yum install -y python3
alias python=python3
yum install -y python3-devel
pip3 install awscli --upgrade
pip3 install virtualenv
yum install -y zlib-devel
cd /home/
mkdir envs
chmod 777 envs
mkdir -p /home/lambdapack/python/lib/python3.7/site-packages
find /home/lambdapack -type d -exec chmod 777 {} \;
cd /home/envs
python3 -m virtualenv env --python=python3
source env/bin/activate
pip install numpy -U
pip install spacy -U
cd /home/lambdapack/python/lib/python3.7/site-packages
cp -R /home/envs/env/lib/python3.7/site-packages/* .
cp -R /home/envs/env/lib64/python3.7/site-packages/* .
find /home/lambdapack -type f -exec chmod 777 {} \;
# cleaning
find . -type d -name "tests" -exec rm -rf {} +
find -name "*.so" | xargs strip
find -name "*.so.*" | xargs strip
rm -r pip
rm -r pip-*
rm -r wheel
rm -r wheel-*
rm easy_install.py
rm -rf spacy/lang/nb
rm -rf spacy/lang/sv
find . -name __pycache__ -type d -exec rm -rf {} \;
find . -name "*.dist-info" -type d -exec rm -rf {} \;
find . -name \*.pyc -delete
echo "stripped size $(du -sh /home/lambdapack | cut -f1)"
# compressing
cd /home/lambdapack
zip -r ../pack.zip *
echo "compressed size $(du -sh /home/pack.zip | cut -f1)"
# publish layer version
cd /home
aws lambda publish-layer-version --layer-name spacy --zip-file fileb://layerpack.zip --compatible-runtimes python3.7 --region eu-west-2
In the last line, the generated zip is used to create a lambda layer. After that, doing Natural Language Processing is as simple as including this layer in lambda and thereon treat it as any other python library. There is one little space saving trick. A sample code using spaCy would look like this:
import spacy
nlp = spacy.load(f'en_core_web_sm-2.0.0')
doc = nlp(spacy_input)
for token in doc:
print(token.text, token.pos_, token.label_)
This expects having the model in the working directory. But as space is limited in lambda, I took advantage of the model being a separate component to the library and uploaded the model in an S3 bucket. Before initialising spaCy, I download the model from S3. This is accomplished by the method below.
def download_dir(dist, local, bucket):
client = get_boto3_client('s3', lambda n: boto3.client('s3'))
resource = get_boto3_client('s3r', lambda n: boto3.resource('s3'))
paginator = client.get_paginator('list_objects')
for result in paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=dist):
if result.get('CommonPrefixes') is not None:
for subdir in result.get('CommonPrefixes'):
download_dir(subdir.get('Prefix'), local, bucket)
if result.get('Contents') is not None:
for file in result.get('Contents'):
if not os.path.exists(os.path.dirname(local + os.sep + file.get('Key'))):
os.makedirs(os.path.dirname(local + os.sep + file.get('Key')))
dest_path = local + os.sep + file.get('Key')
if not dest_path.endswith('/'):
resource.meta.client.download_file(bucket, file.get('Key'), dest_path)
And the code using spaCy looks like this:
import spacy
if not os.path.isdir(f'/tmp/en_core_web_sm-2.0.0'):
download_dir(lang, '/tmp', mapping_bucket)
spacy.util.set_data_path('/tmp')
nlp = spacy.load(f'/tmp/en_core_web_sm-2.0.0')
doc = nlp(spacy_input)
for token in doc:
print(token.text, token.pos_, token.label_)
An example execution for the text Modi campaigns ahead of Lok Sabha election in India
would produce output as below:
Modi 0 PERSON
Lok Sabha 24 ORG
election 34 EVENT
India 46 GPE
So, how good is spaCy?
You generally get good results except for occasional miscategorisation of entities. It depends a lot on how well written the text blob is. If the input text is a standard English prose describing events, it does well. There are some issues. It does not clean up strings that start with characters other than alphabetic character. So, something like “Harrison
can come up as a valid person entity. This can be rectified by having checks in code for populating entity dictionary. Another problem is having different versions of the same person’s name. For example, Theresa May
might be spelled as Teresa May
and in the same document both spellings might appear. spaCy treats both as valid entities. We got around this by having a mapping for incorrect and correct spellings in DynamoDB and running checks against that. Sometimes there are simply different versions of the same entity. For example, Vladimir Putin and Putin are treated as valid and distinct entities. It is possible to have logic that always replaces substring with superstring in the final entity dictionary. But this can lead to data loss if there is article where both Donald Trump
and Donald Trump Jr
are present. But it is practical to use this heuristic as you are getting a huge enhancement of clean entity sets at a comparatively small price. If text is more passive, results can be more unwieldly. For example, we tried applying this on UN general assembly resolutions and words like Confirmed
, Requests
, Recognizing
, Welcoming
, Reiterates
, Expressing
were categorised as people’s names. A possible improvement could be using a larger model as we only used en_core_web_sm
which is the smallest model.
Is that all?
I later extended the lambda to create summary of each article using spaCy generated lemma. The code followed the approach discussed here. As specified in the description at the top of the file, in step 6, the sentences are ranked. But as this depends on having well formed sentences, this is also sensitive to how well punctuated the text is. I ended up with some summaries as long as 10 lines on a standard A4 page.
As we had duplicate documents coming from various sources with small variation in content, we decided to try spaCy’s similarity score to identify which articles were duplicates of each other. We always limited it within articles that are published on the same day. For checking similarity between text1 and text2, all you have to do is nlp(text1).similarity(nlp(text2))
.
The results were very misleading. Any similarity below 99% were vastly different, even completely unrelated articles. It seemed like, it is creating similarity percentage based on the type of events that are being mentioned and completely ignoring the people, location and time. So, it might be that it will be useful to find plagiarism among documents rather than the kind we were after.