Puppet:Out of Range for Type Integer

This week we ran into a rather small (or is it large) problem with our puppet instance. We logged into the puppet console and noticed that there were over 37,000 pending tasks, and the list was growing fast. Checking the logs, we saw an "out of range" exception. An out of range exception for an enterprise product is never a good thing. It’s almost as bad as a segmentation fault in an enterprise product, something you can do nothing about if you don’t have access to the source code. In this case though, we actually can do something about this particular issue.

Here’s the exact error we were seeing…​

2015-04-17T22:30:15+0000: [Worker(delayed_job.7 host:http://foosite.com pid:17446)] Class#create_from_yaml failed with ActiveRecord::StatementInvalid: PG::Error: ERROR: value "2147716789" is out of range for type integer: INSERT INTO "resource_events" ("audited", "desired_value", "historical_value", "message", "name", "previous_value", "property", "resource_status_id", "status", "time") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10) RETURNING "id" - 2 failed attempts 2015-04-17T22:30:15+0000: [Worker(delayed_job.7 host:http://foosite.com pid:17446)] PERMANENTLY removing Class#create_from_yaml because of 3 consecutive failures.
2015-04-17T22:30:15+0000: [Worker(delayed_job.7 host:http://foosite.com pid:17446)] 1 jobs processed at 0.3037 j/s, 1 failed ...
2015-04-17T22:30:15+0000: [Worker(delayed_job.2 host:http://foosite.com pid:17361)] Class#create_from_yaml failed with
ActiveRecord::StatementInvalid: PG::Error: ERROR: value "2147716814" is out of range for type integer

Solution

It turns out that the functionality that uses this is deprecated as of early 2014, so this supposedly isn’t an issue with newer puppet installs. However, if you’re using an older puppet (3.0 or older), you might run into this problem.

The problem lies in the database schema for the puppet console. Basically, every time a node checks in, it inserts a row into the database. The database has some tables with columns that auto-increment (0, 1, 2, 3, etc). If you have a lot of nodes reporting back frequently, this number will likely increase a lot over time. In our case, we have 333 nodes reporting every 30 minutes or more (we do development and thus we often manually run puppet agent with the -t switch). In our case, to hit 37,000, it would have taken a little over 2 days (30*(24*60)*333 = 1 day’s checkin count)

The columns that autoincrement use the int datatype. This datatype, as seen here, uses 4 bytes. In case anyone doesn’t remember, there are 8 bits in a byte, which means that 4 * 8 = 32. That means that the maximum number that will fit in any column with the int data type is 2(32-1), which equals 2,147,483,648. That means 2 billion puppet reports. It seems like a number not easy to achieve, but it is quite possible - we did it.

The solution here is to change the data type on the columns in concern to be bigint rather than integer. Again, as documented by the postgres folks here, a bigint is 8 bytes, which is a 64 bit number. That means the largest it can hold is 9,223,372,036,854,775,807 (about 9 quintillion). That said, let’s get to it.

Executing the Fix

Before performing the fix, we should probably perform a backup of the database, unless you’re the type who likes causing a fire you have to put out on a Friday, just like…​

Compliments of Gimp

To execute a backup (we’ll assume your database name is console), run

pgsql_dump -U console -W -h localhost console > console.$(date '+%Y%d%m.%H%M').sql

Once that backup is complete (mine was 86 GB, so it took a while), shut down all of your puppet services to be safe. A list of services you might want to shut down can be found here. A general rule of thumb though is, anything in /etc/init.d/ that starts with pe- is something that should be stopped, excepting pe-postgresql.

Once that’s done, execute this fun console trick.

$ psql -U console -W -h localhot

-- Change to the console table console=> \c console

-- This one might take a *very* long time (mine took an hour) console=>
alter table resource_statuses alter column id type bigint; console=>
alter table resource_events alter column id type bigint; console=> alter
table resource_events alter column resource_status_id type bigint
console=> \q

With that, restart the pe-postgresql service for good measure. Once that’s done restarting, start up the other pe-* services and everything should be working now.

This is a bug that was reported about three years ago. They have since migrated ticket tracking systems, so the links can be found at…​