Explore our expert-made templates & start with the right one for you.
Tips and Tricks for Working with Dates and Times in your Data Pipelines
Working with date and timestamp data seems like it should be trivial. But many complications can arise:
- Date formats can vary by country and region.
- For example, MM/DD/YYYY or DD/MM/YYYY
- Months can be referenced by different string formats or by a number.
- For example, “March” or “MAR” or “03”
- Timestamps can display as a formal datetime data type or as a string you must convert.
- Times can be referenced as UTC or with a specific time zone.
- Working with timestamps that are calculated from epoch
And these are only some of the reasons why working with dates and times – especially when generated from different data sources – can be a significant challenge for data engineers trying to build consistent data pipelines.
This blog post gives you some “tips and tricks” for working with dates and times that may display in data throughout a transformation pipeline.
Extracting Dates and Times from Source Data
While many structured database platforms support date and time datatypes, file-based data stored as JSON, CSV, TSV, XML, and others relies on storing dates and times as strings. Ingesting this data, whether in files or when passed onto a streaming platform such as Kafka or Kinesis, requires a data engineer to parse these strings and then turn them into datetime values that can be transformed in downstream processes. String parsing often involves extensive regex work, which is often tedious and error prone.
As data engineers incorporate additional data sources into their pipelines, the specific date formats used in each dataset must be validated and normalized to a standard type/format. It’s highly likely that modifications to their date parsing processes will be necessary.
When you ingest data into Upsolver SQLake, you can use a function called
EXTRACT_TIMESTAMP(<datestring>)which takes date strings of various formats and converts them to a consistent timestamp data type of format ‘
YYYY-MM-dddd hh:mm:ss’. While parsing the string, it also infers the timezone and converts the timestamp to UTC time.
Below are some examples of strings, along with the
EXTRACT_TIMESTAMP output when executed on that string. For reference, the third example (1662542705000) is the number of seconds since epoch.
|“March 9, 2022 08:04:32”||2022-03-09 08:04:32.000|
|“March 9, 2022 08:04:32 US/Eastern”||2022-03-09 08:04:32.000|
|2022-08-19 20:46:00||2022-08-19 20:46:00.000|
Working with Dates When Aggregating Data over Time
Many transformations involving aggregations must process data spanning multiple time intervals. Common aggregation use cases include, but are not limited to:
- Aggregating sales performance by month and quarter.
- Calculating IoT device uptime per minute, hour, and day.
- Summing advertising impression counts on an hourly basis.
With streaming data often including timestamps with millisecond (or shorter) granularity, a data engineer often must transform or truncate a source timestamp to a minimum time period that can be aggregated on. For this, Upsolver SQLake provides a
DATE_TRUNC() function, which accepts two parameters:
- The granularity that you want to truncate to (‘day’, ‘hour’, ‘minute’, and so on).
- The timestamp field you wish to truncate.
As an example, when all of the timestamps shown below are passed into
DATE_TRUNC(<datepart>, <timestamp>), they all convert to the timestamp shown in the third column. In turn this enables simple aggregation via a GROUP BY clause in a SQL statement.
|2022-03-09 08:04:32.000||‘hour’||2022-03-09 08:00:00.000|
|2022-03-09 13:04:32.000||‘minute’||2022-03-09 13:04:00.000|
|2022-09-07 09:25:05.000||‘month’||2022-09-01 00:00:00.000|
If you do not wish to truncate the date, you can also extract a date part out of a timestamp using a variety of functions including
The below example shows what these function calls return on the same timestamp as in the prior examples (2022-03-09 08:04:32.000):
Extracting the date part can be useful if, for example, you want to aggregate all data in the month of March, including the month of March for all years.
DATE_TRUNC(‘month’,<timestamp>) also enables a monthly aggregate, although March of 2022 and March of 2021 would be aggregated separately.
Converting Timestamps to Different Time Formats and Time Zones
Business users often request that time data be presented in a specific format, or be converted into a certain time zone. For example, a regional sales leader who works in Europe may want to see dates in a DD/MM/YYYY format in a CET timezone, whereas a sales leader in the United States may want to see dates in a MM/DD/YYYY format in Eastern Time. The function
EXTRACT_TIMESTAMP(), as we’ve discussed, makes it simple to convert a source string into a proper timestamp datatype. And when a proper timestamp has been established as a standard, it can then be reconverted into many different formats and timezones.
Upsolver SQLake provides a
FORMAT_DATETIME() function that makes it possible to convert a timestamp into any format as defined by the
<format string> parameter.
Below you can see a few examples of different <format string> options and how timestamp is converted in each scenario.
Timestamp: 2022-03-09 15:04:32.29072000
|Wed, March 09, 2022 15:04:32.29072000|
For more detailed assistance in building format strings, please see this Oracle documentation on the class DateTimeFormatter.
To convert timestamps between different time zones, you can use the
ADD_TIME_ZONE_OFFSET(<timestamp>,<time zone>) function, where
<time zone> can be in a variety of formats following the java ZoneID protocol. Examples below show a few options when calling this function, all of which would convert a source timestamp (assumed to be UTC) to local time in various time zones.
Timestamp: 2022-03-09 15:04:32.29072000
Timestamp Processing with SQLake Declarative Pipelines
We’ve reviewed several methods native to SQLake that you can use to extract, convert, and format timestamp values within data pipelines. While we’ve explained each function call separately, they are often used together, in a single line, to perform transformations. In the code below, a string is passed into a SQLake transformation job. The string is converted to a timestamp via
EXTRACT_TIMESTAMP(), then converted to the Israel time zone using
ADD_TIME_ZONE_OFFSET, and then formatted into a readable value of
FORMAT_DATETIME( ADD_TIME_ZONE_OFFSET( EXTRACT_TIMESTAMP(date_val),'Israel') ,'dd/MM/YYYY HH:mm:ss )
If you wrap these transformation functions into a SQLake job, you can create as many variations as needed in your output tables. Below is the transformation job from the “Working with dates and times” SQLake template (details for accessing and using this template yourself are below).
CREATE JOB transform_dates START_FROM = BEGINNING ADD_MISSING_COLUMNS = TRUE RUN_INTERVAL = 1 MINUTE AS INSERT INTO default_glue_catalog.database_27eaf9.datetime_formatted MAP_COLUMNS_BY_NAME SELECT date_val as date_string, EXTRACT_TIMESTAMP(date_val) as ex_ts, FORMAT_DATETIME(EXTRACT_TIMESTAMP(date_val),'dd/MM/YYYY HH:mm:ss') as date_formatted, FORMAT_DATETIME(ADD_TIME_ZONE_OFFSET(EXTRACT_TIMESTAMP(date_val),'Israel'),'dd/MM/YYYY HH:mm:ss') as date_formatted_isr, FORMAT_DATETIME(ADD_TIME_ZONE_OFFSET(EXTRACT_TIMESTAMP(date_val), 'America/New_York'), 'MM/dd/YYYY hh:mm:ssa') as date_formatted_ny, FORMAT_DATETIME(ADD_TIME_ZONE_OFFSET(EXTRACT_TIMESTAMP(date_val), 'America/New_York'), 'E, MMMM dd, YYYY HH:mm:ss.A') as date_formatted_long, ADD_TIME_ZONE_OFFSET(EXTRACT_TIMESTAMP(date_val), 'America/New_York') as date_ny, ADD_TIME_ZONE_OFFSET(EXTRACT_TIMESTAMP(date_val), '-5') as date_ny_minus5, MONTH(EXTRACT_TIMESTAMP(date_val)) as date_month, DATE_TRUNC('month', EXTRACT_TIMESTAMP(date_val)) as date_trunc_month, DAY(EXTRACT_TIMESTAMP(date_val)) as date_day, DATE_TRUNC('day', EXTRACT_TIMESTAMP(date_val)) as date_trunc_day, DAY_OF_YEAR(EXTRACT_TIMESTAMP(date_val)) as date_day_of_year, DATE_TRUNC('hour', EXTRACT_TIMESTAMP(date_val)) as date_trunc_hour, HOUR(EXTRACT_TIMESTAMP(date_val)) as date_hour FROM default_glue_catalog.database_27eaf9.datetime_raw WHERE $commit_time BETWEEN RUN_START_TIME() AND RUN_END_TIME();
Try Using Upsolver SQLake for Free (Early Access) to Simplify Your Timestamp Processing
As you’ve seen, Upsolver SQLake makes it easy to work with timestamps, even when the source data does not formally support a datetime data type. Transformations that may have previously required complex string parsing and many lines of code can now be executed with a single SQL statement and a few function calls.
To try this for yourself, access the “Working with dates and timestamps” template in SQLake.
SQLake is Upsolver’s newest offering. It lets you build and run reliable data pipelines on streaming and batch data via an all-SQL experience. You can try SQLake for free for 30 days; no credit card required.