Elasticsearch - Defining the mapping of Twitter Data

Christina Boididou
6 min readAug 4, 2018

--

Lately, I started working with Elasticsearch with the intention to load and query some Twitter data. In this post, I want to share with you how I defined the mapping of this particular data and what challenges I faced. I consider Twitter a very popular source of data so having the process of defining the mapping summarised here would be potentially useful to anyone who wants to use this data.

Exploring the dynamic mapping

If you happened to work with Elasticsearch, you already know that the first step of using it is to create an index where the documents live. This index also holds information about what type each of the document’s fields is, which is defined in mapping. Elasticsearch docs define mapping as:

Mapping is the process of defining how a document, and the fields it contains, are stored and indexed.

However, Elasticsearch gives the flexibility, thanks to dynamic mapping, to automatically add and define a field’s type just by indexing a document. Given this, I first tried to index a document without defining the mapping and look into what Elasticsearch came up with for the field types.

Each tweet in Twitter data is a JSON document with several fields referring to the actual tweet and the Twitter user as well. One can find more about Twitter’s data format in the Twitter docs.

So, I indexed the document into the twitter index, under a _doc document type with the custom id of 1 using the Index API as follows:

PUT twitter/_doc/1
{
"text" : "The text of the tweet",
"created_at" : "Thu Jul 31 23:00:09 +0000 2014",
"other_field": "",
.....
}

After this, I checked what dynamic mapping created

GET /twitter/_mapping

..and the result was (Note that the majority of the fields were removed for displaying purposes; find the whole mapping in this gist):

{
"twitter": {
"mappings": {
"_doc": {
"properties": {
"coordinates": {
"properties": {
"coordinates": {
"type": "float"
},
"type": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"created_at": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"favorite_count": {
"type": "long"
},
"geo": {
"properties": {
"coordinates": {
"type": "float"
},
"type": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"id": {
"type": "long"
},
"place": {
"properties": {
"attributes": {
"type": "object"
},
"bounding_box": {
"properties": {
"coordinates": {
"type": "float"
},
"type": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"country": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"full_name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
"text": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"user": {
"properties": {
"contributors_enabled": {
"type": "boolean"
},
"created_at": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"description": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"favourites_count": {
"type": "long"
},
"followers_count": {
"type": "long"
},
"friends_count": {
"type": "long"
},
"geo_enabled": {
"type": "boolean"
},
"listed_count": {
"type": "long"
},
"location": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"screen_name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"statuses_count": {
"type": "long"
}
}
}
}
}
}
}
}

The first interesting thing is that Elasticsearch is creating two versions of the string fields, a text one and a keyword one. The former is being analyzed by the chosen analyzer while being indexed, while the latter is being stored as it is and it is really useful when it comes to aggregations and grouping. This saves us a lot of time of having to define the keyword type for every single field, as it used to be in previous Elasticsearch versions.

Defining the mapping

Although Elasticsearch did a good job identifying the type of the fields, there were some cases that specific definition had to be given. Dynamic mapping is good when we want to index data quickly; however, the mapping of the existing fields cannot be updated. Thus, we should carefully choose the type of specific fields before indexing, to avoid the process of re-indexing every time we want to update our mapping.

Going back to the mapping, I noticed that the created_at field (along with the retweeted_status.created_at, user.created_at and retweeted_status.user.created_at fields) are indexed as string fields, while it would be really useful to store them with date type in case we want to perform a date range query.

An example of a tweet’s date is:

"created_at": "Thu Jul 31 23:00:09 +0000 2014"

According to the pattern syntax, the tweet format is as follows:

"EEE MMM dd HH:mm:ss Z yyyy"

Given this, I defined the mapping for the above fields as:

PUT twitter
{
"mappings": {
"_doc": {
"properties": {
"created_at": {
"type": "date",
"format": "EEE MMM dd HH:mm:ss Z yyyy"
},
"retweeted_status.created_at": {
"type": "date",
"format": "EEE MMM dd HH:mm:ss Z yyyy"
},

"user.created_at": {
"type": "date",
"format": "EEE MMM dd HH:mm:ss Z yyyy"
},
"retweeted_status.user.created_at": {
"type": "date",
"format": "EEE MMM dd HH:mm:ss Z yyyy"
}
}
}
}
}

Other fields that are not automatically indexed as should be are the geo fields (they are indexed as array of float). Twitter provides two fields that hold the same information in different format, the coordinates.coordinates and the geo.coordinates nested fields. The former is the one I decided to use, as it matches Elasticsearch’s way of representing a Geo Point (array of [lon, lat]) and it also complies with the GeoJSON format.

So, the resulting mapping:

PUT twitter
{
"mappings": {
"_doc": {
"properties": {
"coordinates.coordinates": {
"type": "geo_point"
}
}
}
}
}

Last was the bounding_box field, that encloses the coordinates of a Twitter place. This is a nested field that contains a Polygon type and the actual coordinates. This one was a tricky one. The obvious way is to handle it as a geo_shape type; however, Elasticsearch expects the first and the last point of the Polygon to match, in order for it to be closed. Twitter polygons are not closed though, so that threw an error when indexing documents.

The solution to this was to use “coerce”=True when indexing, a parameter that tries to infer the correct coordinates. Even with coerce, I had cases in the data where the polygon was consisting of four identical geo points, similar to this:

"bounding_box": {
"type": "Polygon",
"coordinates": [
[
[-4.43193, 55.864109],
[-4.43193, 55.864109],
[-4.43193, 55.864109],
[-4.43193, 55.864109]
]
]
}

The above was throwing an exception when it was supposed to be indexed. The “ignore_malformed”: true solved the problem and indexed the documents:

"place.bounding_box": {
"type": "geo_shape",
"coerce": true,
"ignore_malformed": true
}

After all these considerations, I resulted in the following mapping for the Twitter data (please find the whole custom mapping in this gist):

PUT twitter
{
"mappings": {
"_doc": {
"properties": {
"created_at": {
"type": "date",
"format": "EEE MMM dd HH:mm:ss Z yyyy"
},
"retweeted_status.created_at": {
"type": "date",
"format": "EEE MMM dd HH:mm:ss Z yyyy"
},

"user.created_at": {
"type": "date",
"format": "EEE MMM dd HH:mm:ss Z yyyy"
},
"retweeted_status.user.created_at": {
"type": "date",
"format": "EEE MMM dd HH:mm:ss Z yyyy"
},
"coordinates.coordinates": {
"type": "geo_point"
},
"place.bounding_box": {
"type": "geo_shape",
"coerce": true,
"ignore_malformed": true
}
}
}
}
}

Wrapping up

In this article, I have presented the process of defining the mapping of Twitter data, looking into its basic fields types. Of course, this is not the only way to define it. According to a project’s specific needs, quite a lot of different mappings can be defined, also taking advantage of the wide range of text analyzers and other built-in capabilities that Elasticsearch offers.

Please share in the comments your way of creating a Twitter mapping or just your suggestions and thoughts.

--

--

Christina Boididou

Data enthusiast, Machine Learning fan. Doing data work at the BBC