sqldf: query data by range of dates
我正在读取具有
像下面这样的运行命令返回了一个带有0行的data.frame:
1 2 3 4 5 6 7 8 | first_date <-"2001-11-1" second_date <-"2003-11-1" query <-"select * from file WHERE strftime('%d/%m/%Y', Date, 'unixepoch', 'localtime') between '$first_date' AND '$second_date'" df <- read.csv.sql(data_file, sql= query, stringsAsFactors=FALSE, sep =";", header = TRUE) |
因此,为了进行仿真,我尝试使用
1 2 3 4 5 6 7 8 9 10 | first_date <-"2001-11-1" second_date <-"2003-11-1" df2 <- data.frame( Date = paste(rep(1:3, each = 4), 11:12, 2001:2012, sep ="/")) sqldf("SELECT * FROM df2 WHERE strftime('%d/%m/%Y', Date, 'unixepoch') BETWEEN '$first-date' AND '$second_date'") # Expect: # Date # 1 1-11-2001 # 2 1-12-2002 # 3 1-11-2003 |
具有百分比代码的strftime
1 2 3 4 | library(sqldf) sqldf("select strftime('%d-%m-%Y', 'now') now") ## now ## 1 07-09-2014 |
讨论由于SQlite缺少日期类型,因此处理它会比较麻烦,尤其是使用1或2位非标准日期格式时,但是如果您真的想使用SQLite,我们可以通过繁琐地解析日期字符串来做到这一点。使用gsubfn包中的
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | library(sqldf) library(gsubfn) zero2d <- function(x) sprintf("substr('0' || %s, -2)", x) rmSlash <- function(x) sprintf("replace(%s, '/', '')", x) Year <- function(x) sprintf("substr(%s, -4)", x) Month <- function(x) { y <- sprintf("substr(%s, instr(%s, '/') + 1, 2)", x, x) zero2d(rmSlash(y)) } Day <- function(x) { y <- sprintf("substr(%s, 1, 2)", x) zero2d(rmSlash(y)) } fmtDate <- function(x) format(as.Date(x)) sql <-"select * from df2 where `Year('Date')` || '-' || `Month('Date')` || '-' || `Day('Date')` between '`fmtDate(first_date)`' and '`fmtDate(second_date)`'" fn$sqldf(sql) |
给予:
1 2 3 4 | Date 1 1/11/2001 2 1/12/2002 3 1/11/2003 |
笔记
1)使用
2)SQL
1 2 3 4 5 6 7 8 9 | > cat( fn$identity(sql)," ") select * from df2 where substr(Date, -4) || '-' || substr('0' || replace(substr(Date, instr(Date, '/') + 1, 2), '/', ''), -2) || '-' || substr('0' || replace(substr(Date, 1, 2), '/', ''), -2) between '2001-11-01' and '2003-11-01' |
3)并发症的来源主要并发症是非标准的1位或2位数字的日和月。如果它们始终为两位数,则应减少为:
1 2 3 4 5 6 7 8 | first_date <-"2001-11-01" second_date <-""2003-11-01" fn$sqldf("select Date from df2 where substr(Date, -4) || '-' || substr(Date, 4, 2) || '-' || substr(Date, 1, 2) between '`first_date`' and '`second_date`'") |
4)H2这是H2解决方案。 H2确实具有日期时间类型,从而大大简化了SQLite上的解决方案。我们假定数据在名为
1 2 3 4 5 6 7 8 9 10 | library(RH2) library(sqldf) first_date <-"2001-11-01" second_date <-"2003-11-01" fn$sqldf(c("CREATE TABLE t(DATE TIMESTAMP) AS SELECT parsedatetime(DATE, 'd/M/y') as DATE FROM CSVREAD('mydata.dat')", "SELECT DATE FROM t WHERE DATE between '`first_date`' and '`second_date`'")) |
请注意,第一个RH2查询在会话中会很慢,因为它会加载Java。之后,您可以尝试一下性能是否足够。