Question
How do I connect R to a Redshift database on Apple M3?
Intro and System
I am trying to access a Redshift database from R. I am on an Apple M3 Pro machine on Sonoma 14.5. I have R version 4.4.1. I am expecting to give the database name (e.g., "dev"), the host URL, the port number, my username, and my password; I am expecting to receive a stable connection where I can query the database.
I have tried using a few packages, described below.
Attempt 1: RPostgres
{RPostgres}
is the most straightforward. I used this post and the official package documentation to help set up.
First, I installed the necessary system software by running brew install libpq
. I also elsewhere saw that brew install postgresql
may help, too, so I have those installed.
I installed the package using install.packages("RPostgres")
as well as installing it from source using remotes::install_github("r-dbi/RPostgres")
or install.packages("RPostgres", type = "source")
. I then ran:
con <- dbConnect(
RPostgres::Redshift(),
dbname = "dev",
host = Sys.getenv("RS_URLS"),
port = 5439,
user = Sys.getenv("RS_USER"),
password = Sys.getenv("RS_PASS"),
sslmode = "require"
)
Where the environmental variables are the Redshift URL, username, and password for my database. No matter what I tried, I keep getting the error:
Error: connection to server at "<host>" (<ip>), port 5439 failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
This appears to be a known and open issue. It doesn't matter if I installed from CRAN or from source—I get the same error.
Attempt 2: RPostgreSQL
{RPostgreSQL}
says it will also use libpq
in the Description
to the package on CRAN. Using the syntax from the example in the documentation:
drv <- dbDriver("PostgreSQL")
con <- dbConnect(
drv,
dbname = "dev",
host = Sys.getenv("RS_URLS"),
port = 5439,
user = Sys.getenv("RS_USER"),
password = Sys.getenv("RS_PASS")
)
Throws the error:
Error in postgresqlNewConnection(drv, ...) :
RPosgreSQL error: could not connect <user>@<host>:5439 on dbname "dev": FATAL: no pg_hba.conf entry for host "???", user "<user>", database "dev", SSL off
Note on SSL that I have successfully hooked Redshift up to a database visualizer on my computer, and it required SSL to be on. Candidy, I have no clue what SSL is or means, but there isn't an option for it here (while there is in {RPostgres}
but it still failed with that flagged).
I made some progress here by discovering this post. The OP was adding SSL information to their database name. I can do the same:
library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con <- dbConnect(
drv,
dbname = "'dbname=dev sll=require'",
host = Sys.getenv("RS_URLS"),
port = 5439,
user = Sys.getenv("RS_USER"),
password = Sys.getenv("RS_PASS")
)
Interestingly, this gets me to no error but an aborted R session (much in common with Attempt 4 using RJDBC below).
Attempt 3: odbc
Posit (fka RStudio) officially recommends using ODBC via the {odbc}
package. There is comprehensive documentation:
- Package vignette for setting up
- Setting up a driver
- Amazon guide to downloading their ODBC Redshift driver
- Format for connecting to Redshift
Starting with the vignette, I ran brew install unixodbc
as well as brew install psqlodbc
. However, I don't think that's the driver to use since I want to use Redshift. So I followed the instructions on that Amazon guide for downloading, installing, and formatting the .ini
files for the driver.
Per Amazon's recommendation, I took the odbcinst.ini
file from the Setup
subdirectory from the install and copied it to my home directory, prepended with .
to keep it hidden. When I cat
that file, it reads:
[ODBC Drivers]
Amazon Redshift=Installed
[Amazon Redshift]
Description=Amazon Redshift ODBC Driver
Driver=/opt/amazon/redshift/lib/libamazonredshiftodbc.dylib
The issue is when I run odbcinst -j
in the terminal—per the setup vignette—I do not see the correct paths:
unixODBC 2.3.12
DRIVERS............: /usr/local/etc/odbcinst.ini
SYSTEM DATA SOURCES: /usr/local/etc/odbc.ini
FILE DATA SOURCES..: /usr/local/etc/ODBCDataSources
USER DATA SOURCES..: /Users/<username>/.odbc.ini
SQLULEN Size.......: 8
SQLLEN Size........: 8
SQLSETPOSIROW Size.: 8
And if I cat
the first two .ini
files, they're empty. The user data source .ini
simply does not exist.
After reading this GitHub issue, I realized I had to change my .Renviron
file to have it look for the right directory to get the correct .ini
file. I updated that to: ODBCSYSINI="/Users/<username>"
since Amazon suggested I move the .ini
to my home directory (since it will re-write the file if one gets a new download or reinstalls the driver).
So now, into R.
If I have the {odbc}
package list the drivers, it reads the correct path where I know the driver exists:
> odbc::odbcListDrivers()
name attribute value
1 ODBC Drivers Amazon Redshift Installed
2 Amazon Redshift Description Amazon Redshift ODBC Driver
3 Amazon Redshift Driver /opt/amazon/redshift/lib/libamazonredshiftodbc.dylib
I've navigated there repeatedly and know that driver is there. So when I use the suggested Redshift format, I enter:
con <- dbConnect(
odbc(),
Driver = "Amazon Redshift",
servername = Sys.getenv("RS_URLS"),
database = "dev",
UID = Sys.getenv("RS_USER"),
PWD = Sys.getenv("RS_PASS"),
Port = 5439
)
And I receive the error:
Error: nanodbc/nanodbc.cpp:1138: 00000%0D%0A[Amazon][ODBC] (11560) Unable to locate SQLGetPrivateProfileString function: [Amazon][Support] (50483) Could not load shared library, all attempted paths ("") failed
Googling around, I actually ended up at another Posit document—albeit for a different database—that was aimed at troubleshooting for Mac users. It suggests finding where libodbcinst.dylib
is and adding it to the configuration file for the driver. For me, that is at /opt/amazon/redshift/lib/
and it's called amazon.redshiftodbc.ini
. Per the link above, I added the location to libodbcinst.dylib
and another line at the bottom of it:
ODBCInstLib=/usr/local/Cellar/unixodbc/2.3.12/lib/libodbcinst.dylib
DriverManagerEncoding=UTF-16
There's also another location for this file elsewhere in /usr/local/lib/
but it points to that Cellar
path:
libodbcinst.dylib -> ../Cellar/unixodbc/2.3.12/lib/libodbcinst.dylib
Regardless of what path I put in the configuration file, I get the same error, but now it adds that path to the error message:
> con <- dbConnect(
+ odbc(),
+ Driver = "Amazon Redshift",
+ servername = Sys.getenv("RS_URLS"),
+ database = "dev",
+ UID = Sys.getenv("RS_USER"),
+ PWD = Sys.getenv("RS_PASS"),
+ Port = 5439
+ )
Error: nanodbc/nanodbc.cpp:1138: 00000
[Amazon][ODBC] (11560) Unable to locate SQLGetPrivateProfileString function: [Amazon][Support] (50483) Could not load shared library, all attempted paths ("", "/usr/local/Cellar/unixodbc/2.3.12/lib/libodbcinst.dylib") failed
>
I spent a lot of time fiddling around with .ini
files and where they point, etc., and can't get anything beyond that error.
My lingering worry here is that the Amazon download page says:
On Linux and macOS X operating systems, you use an ODBC driver manager to configure the ODBC connection settings. ODBC driver managers use configuration files to define and configure ODBC data sources and drivers. The ODBC driver manager that you use depends on the operating system that you use:
unixODBC driver manager (for Linux operating systems)
iODBC driver manager (for macOS X operating system)
However, the setup vignette says:
For Unix and MacOS, ODBC drivers should be compiled against unixODBC. Drivers compiled against iODBC may also work, but are not fully supported.
I'm not totally sure—it very well could be another problem with misspecified paths, despite my best efforts.
Attempt 4: RJDBC
{RJDBC}
is last, because I dislike depending on Java. However, the only "official" guide from Amazon on hooking up to R uses this package. Unfortunately, it was published in 2015 and has me download what is probably an out-of-date driver.
I run this code, from the example:
install.packages("RJDBC")
library(RJDBC)
# download Amazon Redshift JDBC driver
download.file(
'http://s3.amazonaws.com/redshift-downloads/drivers/RedshiftJDBC41-1.1.9.1009.jar',
'RedshiftJDBC41-1.1.9.1009.jar'
)
# connect to Amazon Redshift
driver <- JDBC(
"com.amazon.redshift.jdbc41.Driver",
"RedshiftJDBC41-1.1.9.1009.jar",
identifier.quote = "`"
)
And R immediately aborts the session. I found another guide for this package from 2017, and I run the suggested code therein... and R once again aborts the session when I get to the JDBC()
call reading in the driver.
What if I don't use an older version of the driver? I download it directly from this Amazon link. I then set up the code like:
library(RJDBC)
driver <- JDBC(
"com.amazon.redshift.jdbc42.Driver",
"redshift-jdbc42-2.1.0.29.jar"
)
url <- sprintf(
"jdbc:redshift://%s:%s/%s?tcpKeepAlive=true&ssl=true&sslfactory=com.amazon.redshift.ssl.NonValidatingFactory",
Sys.getenv("RS_URLS"),
5439,
"dev"
)
jconn <- dbConnect(driver, url, Sys.getenv("RS_USER"), Sys.getenv("RS_PASS"))
Where I ticked the driver class up to 42 from 41. This time, I don't get RStudio aborting the R session and exploding until I run dbConnect()
. I also tried getting a new name for the driver class using findDrivers()
as is suggested in the documentation—but encountered the same aborted session upon running dbConnect()
. I also tried including the identifier.quote
parameter here—same outcome.
Summary
How do I connect R to Redshift?
(And why the hell is it this hard, when packages like {bigrquery}
make it trivially simple for other databases?)