"DFLib" (short for "DataFrame Library") is a lightweight, pure Java implementation of DataFrame. DataFrame is a very common structure in data science and Big Data worlds. It provides operations like search, filtering, joins, aggregations, statistical functions, etc., that are reminiscent of SQL (also of Excel), except you run them in your app over dynamic in-memory data sets.

There are DataFrame implementations in Python (pandas), R, Apache Spark, etc. DFLib project’s goal is to provide the same functionality for regular Java applications. It is a simple library, that requires no special infrastructure. DFLib core is dependency-free.

The code in this documentation can be run in any Java IDE, as well as (and this is pretty cool!) in a Jupyter notebook. You will need Java 8 or newer. Running in Jupyter will require Java 9+.

Getting Started with DFLib

Include DFLib in a project. Assuming you are using Maven, start by declaring a "BOM" to have a common version for multiple DFLib modules:

<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>com.nhl.dflib</groupId>
      <artifactId>dflib-bom</artifactId>
      <version>0.7</version>
      <type>pom</type>
      <scope>import</scope>
    </dependency>
  </dependencies>
</dependencyManagement>

Next include DFLib core as a dependency:

<dependency>
    <groupId>com.nhl.dflib</groupId>
    <artifactId>dflib</artifactId>
</dependency>

Create a DataFrame, do some manipulations with it, print the result:

DataFrame df1 = DataFrame
        .newFrame("a", "b", "c")
        .foldIntStreamByRow(IntStream.range(1, 10000));

DataFrame df2 = df1.selectRows(
        df1.getColumnAsInt(0).indexInt(i -> i % 2 == 0));

System.out.println(Printers.tabular.toString(df2));

When you run this code, console output will look like this:

   a    b    c
---- ---- ----
   4    5    6
  10   11   12
  16   17   18
...
9982 9983 9984
9988 9989 9990
9994 9995 9996
1666 rows x 3 columns
We’ll omit print statements in all the following examples, and will simply display their output. Details of printing are discussed in the in "Printers" chapter.

Main Data Structures

The two essential DFLib classes are Series and DataFrame. Series is a 1-dimensional array-like object, and DataFrame is a 2-dimensional table-like object. Columns in a DataFrame are stored as Series. Additionally we will discuss Index object that stores DataFrame column names.

Both DataFrame and Series (and Index) are fully immutable, so all operations on them return a new instance. Behind the scenes the library shares as much data as possible between instances, so copying these objects does not cause significant performance degradation, and in turn makes DFLib fully thread-safe, with multiple concurrent operations possible on the same data structure. Also immutability means that you can have full snapshots of your data at each step of a transformation, simplifying debugging and auditing of data pipelines.

Series

Series is the simplest of the two data structures. You can think of it as a wrapper around an array of values. You can use Series to model sequences of data such as timestamps in a time series. Series object is parameterized for the type of data that it holds. So there can be e.g. Series<String> or a Series<LocalDate>. There is also an important category of "primitive" Series (IntSeries, LongSeries, DoubleSeries, BooleanSeries) that are optimized for memory use and arithmetic operations. While this would be less efficient, they can also be used as Series of a corresponding wrapper object type (e.g. IntSeries is also a Series<Integer>).

Series object is important as a building block of DataFrame, but it also defines a number of useful data manipulation and transformation operations on its own. Those will be covered in the following chapters. Here we’ll discuss how to create Series.

Creating Series from Arrays

Series can be created using static forData method on the Series interface:

Series<String> s = Series.forData("a", "bcd", "ef", "g");
a
bcd
...
g
4 elements

Primitive Series classes have their own factory methods. E.g.:

IntSeries is = IntSeries.forInts(0, 1, -300, Integer.MAX_VALUE);

Creating Series Incrementally

If we don’t have a full array of data to start with, and instead have code producing a sequence of values of unpredictable length, we can use special mutable "accumulators" to store the sequence and then produce a Series out of it. E.g. the following example reads data from an InputStream line by line as Strings:

// InputStream inputStream = ...
Accumulator<String> accum = new ObjectAccumulator<>();
Scanner scanner = new Scanner(inputStream);
while (scanner.hasNext()) {
    accum.add(scanner.next());
}

Series<String> s = accum.toSeries();

Similarly there are IntAccumulator, LongAccumulator, etc.

DataFrame

DataFrame is an in-memory table with rows and columns. Each column has a name, and is represented as a Series object. DataFrame can contain columns of different kinds and is not parameterized for a single type.

There are a few ways to create a DataFrame. Here we’ll show "builder" API to convert arrays, Streams, Collections, Series and other in-memory data into DataFrames.

Alternatively a DataFrame can be loaded from (and saved to) a database or a CSV file. Those data sources are discussed in separate documentation chapters.

First example - adding data row by row:

DataFrame df = DataFrame
        .newFrame("name", "age") (1)
        .addRow("Joe", 18)   (2)
        .addRow("Andrus", 45)
        .addRow("Joan", 32)
        .create();
1 Builder is created from column names
2 Passing individual rows one by one

The resulting DataFrame looks like this:

name   age
------ ---
Joe    18
Andrus 45
Joan   32
3 rows x 2 columns

Another example - if we have a single array and need to "fold" it into a two-dimensional DataFrame row-by-row:

DataFrame df = DataFrame
        .newFrame("name", "age") (1)
        .foldByRow("Joe", 18, "Andrus", 45, "Joan", 32); (2)
1 Builder is created from column names
2 Passing a varargs array of values, that is folded to match the specified number of columns, row by row.

Same, but folding column-by-column:

DataFrame df = DataFrame
        .newFrame("name", "age")
        .foldByColumn("Joe", "Andrus", "Joan", 18, 45, 32);

You can also create DataFrames from Collections and Streams (folded either by row or by column). Here is an example of how to use Streams of primitive ints, creating a DataFrame made of memory-efficient IntSeries columns:

DataFrame df = DataFrame
        .newFrame("col1", "col2")
        .foldIntStreamByColumn(IntStream.range(0, 10000));
col1 col2
---- ----
   0 5000
   1 5001
   2 5002
...
4997 9997
4998 9998
4999 9999
5000 rows x 2 columns

Finally, a DataFrame can be created from an array of Series, each Series representing a column:

DataFrame df = DataFrame
        .newFrame("name", "age")
        .columns(
                Series.forData("Joe", "Andrus", "Joan"),
                IntSeries.forInts(18, 45, 32)
        );

This is the fastest and most efficient way to make a DataFrame, as an array of Series is how it is structured internally.

Now that we’ve discussed how to create Series and DataFrames, we’ll talk about printing their contents, and then will continue to the chapters that describe data operations.

Index

Index is somewhat similar to Series. It stores a sequence of String labels, corresponding to column labels in a DataFrame it is attached to. While you normally won’t need to create or manipulate Index outside of a DataFrame, you may use it to access DataFrame columns labels.

DataFrame df = DataFrame
        .newFrame("col1", "col2")
        .foldIntStreamByColumn(IntStream.range(0, 10000));

String[] labels = df.getColumnsIndex().getLabels(); (1)
1 Access DataFrame index and read label names.

Printers

When doing data exploration and running data pipelines, it is important to be able to visualize data at every step. More advanced forms of visualization include charts and diagrams. But the simplest thing you can do is printing data to the console. Both DataFrame and Series implement toString() method, that will print their contents as a single line, truncating large data sets in the process, which is good for debugging applications running on a server.

A more human-friendly form of output is done by using a tabular printer. Here is how to use the default tabular printer:

DataFrame df = DataFrame
        .newFrame("col1", "col2", "col3")
        .foldIntStreamByColumn(IntStream.range(0, 10000));

String table = Printers.tabular.toString(df);
System.out.println(table);

It prints the data as follows, displaying at most 6 rows and truncating the rest. Same goes for cell values. Values longer than 30 chars are also truncated (since they are pretty short in our example, value truncation is not obvious here).

col1 col2 col3
---- ---- ----
   0 3334 6668
   1 3335 6669
   2 3336 6670
...
3331 6665 9999
3332 6666    0
3333 6667    0
3334 rows x 3 columns

You can change the truncation parameters by creating your own printer:

Printer printer = new TabularPrinter(3, 3); (1)
String table = printer.toString(df);
System.out.println(table);
1 Create a printer that displays at most 3 rows and up to 3 characters in each cell.
c.. c.. c..
--- --- ---
  0 3.. 6..
  1 3.. 6..
...
3.. 6..   0
3334 rows x 3 columns
If you are using Jupyter Notebook, all the printers are already setup for you. So if the last statement in a Jupyter cell is a DataFrame or a Series, it will be printed as a table in the notebook.

Selecting Rows and Columns

Before analyzing or transforming a Series or a DataFrame, very often they need to be to narrowed down to just the relevant data. API for selecting and filtering rows and columns described in this chapter helps with such task.

Selecting Columns

selectColumns(..) method allows to do a few things at once: select a subset of columns in the DataFrame, reorder columns and even duplicate columns.

  • Selecting by column name. In the example below we are getting rid of the "middle" column and reordering "first" and "last" columns:

DataFrame df = DataFrame.newFrame("first", "last", "middle").foldByRow(
        "Jerry", "Cosin", "M",
        "Amanda", "Gabrielly", null,
        "Joan", "O'Hara", "J");

DataFrame df1 = df.selectColumns("last", "first");
last      first
--------- ------
Cosin     Jerry
Gabrielly Amanda
O'Hara    Joan
3 rows x 2 columns
  • Selecting by column position instead of the name:

DataFrame df = DataFrame.newFrame("first", "last", "middle").foldByRow(
        "Jerry", "Cosin", "M",
        "Amanda", "Gabrielly", null,
        "Joan", "O'Hara", "J");

DataFrame df1 = df.selectColumns(1, 0);

Deleting Columns

DataFrame.dropColumns(..) is an operation complimentary to selectColumns. It allows to remove one or more columns, but does not allow to reorder the remaining columns.

DataFrame df = DataFrame.newFrame("first", "last", "middle").foldByRow(
        "Jerry", "Cosin", "M",
        "Amanda", "Gabrielly", null,
        "Joan", "O'Hara", "J");

DataFrame df1 = df.dropColumns("first", "middle"); (1)
1 Drop one or more columns from the DataFrame.
last
---------
Cosin
Gabrielly
O'Hara
3 rows x 1 column

Selecting Rows

  • Selecting a subset of rows by position. As there are no row labels in DFLib yet, selecting by position is the only "direct" selection option. Just like with column selection, it allows to duplicate rows by referencing the same position more than once, as well as reorder rows:

DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
        "Jerry", "Cosin",
        "Amanda", "Gabrielly",
        "Joan", "O'Hara");

DataFrame df1 = df.selectRows(2, 0, 0);
first last
----- ------
Joan  O'Hara
Jerry Cosin
Jerry Cosin
3 rows x 2 columns
  • Selecting using an index. You may precalulate an "index" (an IntSeries) of row positions and then use it for row selection. An index often comes from applying a condition to a Series that is not a part of the DataFrame. E.g.:

DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
        "Jerry", "Cosin",
        "Amanda", "Gabrielly",
        "Joan", "O'Hara");

IntSeries rowNumbers = new IntSequenceSeries(0, df.height());

IntSeries selectionIndex = rowNumbers.indexInt(i -> i % 3 == 0); (1)

DataFrame df1 = df.selectRows(selectionIndex);
1 Index is calculated by applying a condition to a Series, that is not a part of the DataFrame, whose rows we are selecting.
first last
----- -----
Jerry Cosin
1 row x 2 columns

Filtering Rows

A very common form of row selection is "filtering", i.e. building a DataFrame with a subset of rows that match a conditional expression. Unlike selectRows, filterRows(..) methods do not require direct references to row positions. Also filtering would not allow to reorder or duplicate rows. The following flavors of filtering are available:

  • filterRows(..) with column condition:

DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
        "Jerry", "Cosin",
        "Amanda", "Gabrielly",
        "Joan", "O'Hara");

DataFrame df1 = df.filterRows(
        "first",
        (String f) -> f != null && f.startsWith("J"));
first last
----- ------
Jerry Cosin
Joan  O'Hara
2 rows x 2 columns
  • filterRows(..) with full row condition:

DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
        "Jerry", "Cosin",
        "Amanda", "Gabrielly",
        "Joan", "O'Hara");

DataFrame df1 = df.filterRows(r ->
        r.get("first").toString().startsWith("J")
                && r.get("last").toString().startsWith("O"));
first last
----- ------
Joan  O'Hara
1 row x 2 columns
  • filterRows(..) with BooleanSeries "index". We’ve already seen how to select rows using an IntSeries index. It allowed to perform index calculation outside the DataFrame, whose rows we are filtering. BooleanSeries "index" (or "mask") can similarly serve as a filter condition. Such index is usually a result of a prior logical operation between Series. E.g.:

DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
        "Jerry", "Cosin",
        "Amanda", "Gabrielly",
        "Joan", "O'Hara");

Series<String> names = Series.forData("Sandra", "Anton", "Joan");
BooleanSeries mask = names.eq(df.getColumn("first")); (1)

DataFrame df1 = df.filterRows(mask);
1 Index is calculated by comparing a Series of names with DataFrame column of first names.
first last
----- ------
Joan  O'Hara
1 row x 2 columns

Head and Tail

When you only need the "first N" or "last M" rows of a DataFrame (or values of a Series) you can use head and tail operations. Here is how to get the top 2 rows of a DataFrame:

DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
        "Jerry", "Cosin",
        "Amanda", "Gabrielly",
        "Joan", "O'Hara");

DataFrame df1 = df.head(2); (1)
1 Returns a new DataFrame with two top rows of the original DataFrame.
first  last
------ ---------
Jerry  Cosin
Amanda Gabrielly
2 rows x 2 columns

tail works similarly, but returns the last N rows:

DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
        "Jerry", "Cosin",
        "Amanda", "Gabrielly",
        "Joan", "O'Hara");

DataFrame df1 = df.tail(1);
first last
----- ------
Joan  O'Hara
1 row x 2 columns

Series also define head(..) and tail(..) operations that do what you’d expect from them, returning first or last N values.

Unlike index operations on arrays and lists in Java, head(..) and tail(..) are safe in regards to bounds checking. They DO NOT throw exceptions when the length parameter is bigger than DataFrame height (or Series size), returning an empty DataFame (or Series) instead.

Columns Manipulation

Selecting rows and columns, described in the previous chapter, helps to produce a desired subset of data. Here we’ll demonstrate a few extra methods for DataFrame columns manipulation:

  • Adding columns with new data, that may or may not be derived from the data in other DataFrame columns.

  • Renaming columns

  • Normalizing column data and converting it to a specific Java type

Adding Columns

There are a few ways to add columns to a DataFrame:

  • Add a column by calculating cell values, based on data from other columns:

DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
        "Jerry", "Cosin",
        "Amanda", "Gabrielly",
        "Joan", "O'Hara");

DataFrame df1 = df.addColumn(
        "full",                                   (1)
        r -> r.get("first") + " " + r.get("last") (2)
);
1 New column name
2 Operation to produce values for the new column from the existing row. r is a special RowProxy object, providing access to the row being processed.
first   last      full
------ --------- ----------------
Jerry  Cosin     Jerry Cosin
Amanda Gabrielly Amanda Gabrielly
Joan   O'Hara    Joan O'Hara
3 rows x 3 columns
  • Add multiple columns in a single method call (generally faster than adding them individually) :

DataFrame df1 = df.addColumns(
        new String[]{"last_initial", "first_initial"},
        r -> r.get("last").toString().charAt(0),
        r -> r.get("first").toString().charAt(0)
);
first   last      last_initial first_initial
------ --------- ------------ -------------
Jerry  Cosin     C            J
Amanda Gabrielly G            A
Joan   O'Hara    O            J
3 rows x 4 columns
  • Adding a Series. If you already have a Series object of the size matching the DataFrame height, it can be added as a column. E.g. we can add a column that will number DataFrame rows:

IntSeries rowNumbers = new IntSequenceSeries(0, df.height()); (1)
DataFrame df1 = df.addColumn(
        "number",
        rowNumbers
);
1 A Series that produces a sequence of numbers from 0 to df.height().
first   last      number
------ --------- ------
Jerry  Cosin          0
Amanda Gabrielly      1
Joan   O'Hara         2
3 rows x 3 columns

For this particular task (adding row numbers) there’s a shortcut that does the same thing as the code above:

DataFrame df1 = df.addRowNumber("number");

Renaming Columns

Often DataFrames come from external sources (like CSV files or database tables) with column names requiring further cleanup to match the expectations of the application. There are a number of ways to rename columns:

  • Rename individual columns:

DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
        "Jerry", "Cosin",
        "Joan", "O'Hara");

DataFrame df1 = df
        .renameColumn("first", "first_name")
        .renameColumn("last", "last_name");
first_name last_name
---------- ---------
Jerry      Cosin
Joan       O'Hara
2 rows x 2 columns
  • Specify the new names of all columns at once. This may not be practical for a large number of columns:

DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
        "Jerry", "Cosin",
        "Joan", "O'Hara");

DataFrame df1 = df.renameColumns("first_name", "last_name");
first_name last_name
---------- ---------
Jerry      Cosin
Joan       O'Hara
2 rows x 2 columns
  • Rename using a map of old names to new names. Names not present in the map will remain unchanged:

Map<String, String> nameMap = new HashMap<>();
nameMap.put("first", "first_name");

DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
        "Jerry", "Cosin",
        "Joan", "O'Hara");

DataFrame df1 = df.renameColumns(nameMap);
first_name last
---------- ------
Jerry      Cosin
Joan       O'Hara
2 rows x 2 columns
  • Rename using a function, applied to all column names in turn. This is useful e.g. to convert all names to lower case:

DataFrame df = DataFrame.newFrame("FIRST", "LAST").foldByRow(
        "Jerry", "Cosin",
        "Joan", "O'Hara");

DataFrame df1 = df.renameColumns(String::toLowerCase);
first last
----- ------
Jerry Cosin
Joan  O'Hara
2 rows x 2 columns

Changing Column Type

Data may come to us in various forms. Often it comes as Strings or binary objects, leaving it up to the application to interpret them. There are a few methods in DataFrame to convert column values from one type to another. The most generic is convertColumn, allowing to convert between any two data types:

DataFrame df = DataFrame.newFrame("year", "sales").foldByRow(
        "2017", "2005365.01",
        "2018", "4355098.75");

DataFrame df1 = df
        .convertColumn(0, (String s) -> Integer.valueOf(s))
        .convertColumn(1, (String s) -> new BigDecimal(s));

Additionally there are methods that can convert Strings and Numbers to primitive values:

DataFrame df1 = df
        .toIntColumn("year", 0) (1)
        .toDoubleColumn("sales", 0.); (1)
1 The second method argument is the primitive value to use if the value being converted is null

toIntColumn and other such methods are not just syntactic sugar. They convert DataFrame columns to primitive Series, that take significantly less memory and compute most operations faster than Object-based ones. You should always consider using them where possible.

In addition to methods for numeric primitives DataFrame has toBooleanColumn, toEnumFromNumColumn and toEnumFromStringColumn methods that serve similar purpose, only for booleans and enums.

Cleaning up Data

Data coming to DFLib may be of varying quality and hence may require cleanup before processing. We’ve already discussed column type conversion operation, that allows to normalize data among other things. Another typical case of data requiring cleanup is missing values. Such values are represented as regular Java nulls. There is a variety of reasons why data can be missing. A rather common one is outer joins and concatenation operations described further down in this document.

In all of these cases you may have to guess or extrapolate the actual values based on some internal knowledge of the dataset. There are methods in both Series and DataFrame that allow to replace nulls with either fixed values or values from adjacent cells. First, let’s see how this works for Series…​

Filling Nulls in Series

The simplest approach is to replace nulls with a constant value:

Series<String> withNulls = Series.forData("a", "b", null, null, "c");
Series<String> filled = withNulls.fillNulls("X");
a
b
X
X
c
5 elements

Another strategy is to use values adjacent to the cells with nulls:

Series<String> filled = withNulls.fillNullsForward(); (1)
1 fillNullsForward takes a non-null value preceding the null cell and uses it to replace nulls. Similarly there’s a fillNullsBackward method that uses the value following the null cell.
a
b
b
b
c
5 elements

Finally, there’s a way to fill nulls using another Series object as a kind of "template" or "mask" :

Series<String> mask = Series.forData("A", "B", "C", "D", "E", "F", "G");
Series<String> filled = withNulls.fillNullsFromSeries(mask);
a
b
C
D
c
5 elements

Notice that the "mask" Series can be longer (or shorter) than the Series where the nulls are replaced. Values between the two Series objects are aligned by position starting at zero.

Filling Nulls in DataFrame

Filling nulls in a DataFrame is done by column, so the API is very similar to Series.fillNulls* methods, only with an additional String on integer argument specifying column position.

DataFrame withNulls = DataFrame.newFrame("c1", "c2").foldByRow(
        "1", "1",
        null, null,
        "2", "2");

DataFrame filled = withNulls
        .fillNullsBackwards("c1")
        .fillNullsForward("c2");
c1 c2
-- --
1  1
2  1
2  2
3 rows x 2 columns

Sorting

You can sort values in Series, and sort rows in DataFrames. As is the case everywhere else in DFLib, sorting does not alter the original object, and instead creates a new instance of either Series or DataFrame.

First let’s look at sorting Series…​

Sorting Series

Series provides a sort method to sort its data using a custom Comparator:

Series<String> s = Series.forData("12", "1", "123")
        .sort(Comparator.comparingInt(String::length));

The result is a new Series with sorted data:

1
12
123
3 elements

Another example shows how to sort Series in the "natural" order (requires Series type to be compatible with java.lang.Comparable):

Series<String> s = Series.forData("c", "d", "a")
        .sort(Comparator.naturalOrder());
a
c
d
3 elements

Series of primitives provide methods to sort values in the natural order without an explicit comparator:

LongSeries s = LongSeries.forLongs(Long.MAX_VALUE, 15L, 0L)
        .sortLong();
                  0
                 15
9223372036854775807
3 elements

Additionally IntSeries has an optimized method to sort ints with a custom IntComparator.

Next we’ll check how to sort a DataFrame.

Sorting Rows by Column

Rows in a DataFrame can be sorted by any single column, assuming that column contains values that implement java.lang.Comparable (i.e. Strings, numbers, etc):

DataFrame df = DataFrame.newFrame("first", "last", "middle").foldByRow(
        "Jerry", "Cosin", "M",
        "Amanda", "Gabrielly", null,
        "Joan", "O'Hara", "J");

DataFrame df1 = df.sort("first", true); (1)
1 The first argument is the column name (can also be column integer index), the second - a boolean indicating sort direction (true means ascending, false - descending order).
first  last      middle
------ --------- ------
Amanda Gabrielly null
Jerry  Cosin     M
Joan   O'Hara    J
3 rows x 3 columns

Sort column can either be specified by name, as in the example above, or by position:

DataFrame df1 = df.sort(0, true);

Sorting Rows by Multiple Columns

To specify additional sort columns, you can provide an array of column names (or column positions) and an array of sort direction specs:

DataFrame df = DataFrame.newFrame("first", "last", "middle").foldByRow(
        "John", "Cosin", "M",
        "Amanda", "Gabrielly", null,
        "Joan", "Cosin", "J");

DataFrame df1 = df.sort(new String[]{"last", "first"}, new boolean[]{true, false});
first  last      middle
------ --------- ------
John   Cosin     M
Joan   Cosin     J
Amanda Gabrielly null
3 rows x 3 columns

Sorting Rows with Custom Algorithm

If it is not possible to sort by values of a specific column or columns (e.g. when they are not Comparable, or we need to use a non-standard ordering algorithm), you can provide your own function that generates an arbitrary Comparable value for each row in the DataFrame:

DataFrame df = DataFrame.newFrame("first", "last", "middle").foldByRow(
        "Jerry", "Cosin", "M",
        "Amanda", "Gabrielly", null,
        "Joan", "O'Hara", "J");

DataFrame df1 = df.sort(r -> r.get(0).toString().length()); (1)
1 Sorting by the length of first name String.
first  last      middle
------ --------- ------
Joan   O'Hara    J
Jerry  Cosin     M
Amanda Gabrielly null
3 rows x 3 columns

Concatenating

There are various ways to combine data from multiple Series or DataFrames. The simplest is concatenation, described in this chapter. Series can be concatenated together producing a longer Series. DataFrames can be concatenated either vertically (along the rows axis) or horizontally (along the columns axis).

Concatenating Series

If you have a Series object and want to concatenate it with one or more other Series, you’d use Series.concat(..) method:

Series<String> s1 = Series.forData("x", "y", "z");
Series<String> s2 = Series.forData("a");
Series<String> s3 = Series.forData("m", "n");

Series<String> sConcat = s1.concat(s2, s3);

The result is as expected contains values of all Series put together:

x
y
z
a
m
n
6 elements

If you have a collection or an array of Series and want to "glue" them together, you can use a static SeriesConcat.concat(..):

Collection<Series<String>> ss = asList(
        Series.forData("x", "y", "z"),
        Series.forData("a"),
        Series.forData("m", "n"));

Series<String> sConcat = SeriesConcat.concat(ss);

The result is the same as in the previous example. Also you can concatenate Series with itself:

Series<String> s = Series.forData("x", "y");
Series<String> sConcat = s.concat(s);
x
y
x
y
4 elements

Concatenating DataFrames

DataFrame offers two options for concatenation - vertical (stacking DataFrames on top of each other) and horizontal (putting them next to each other). Let’s see some examples..

DataFrame df1 = DataFrame.newFrame("a", "b").foldByRow(
        1, 2,
        3, 4);

DataFrame df2 = DataFrame.newFrame("a", "b").foldByRow(
        5, 6,
        7, 8);

DataFrame dfv = df1.vConcat(df2); (1)
1 Concatenate this and another DataFrame vertically. The argument is a vararg, so more than one DataFrame can be passed to the method.
a b
- -
1 2
3 4
5 6
7 8
4 rows x 2 columns
DataFrame dfh = df1.hConcat(df2); (1)
1 Concatenate this and another DataFrame horizontally. The argument is a vararg, so more than one DataFrame can be passed to the method.
a b a_ b_
- - -- --
1 2 5  6
3 4 7  8
2 rows x 4 columns
Since both df1 and df2 had the same column names, and a DataFrame must only have unique columns, _ suffix was automatically appended to the conflicting columns in the resulting DataFrame. We will see this auto-renaming behavior in other places, such as joins.

So far all our concatenation examples consisted of DataFrames that had matching dimensions (same number and names of columns for vConcat and same number of rows for hConcat). But what if concatenated DataFrames are shaped or named differently?

Concat methods may take an extra "how" parameter to define concatenation semantics. The type of the "how" parameter is JoinType, and can be one of inner (default), left, right, full.

We’ll see JoinType again soon when discussing joins. Concatenation and joins are related fairly closely.

Let’s look how this works with vConcat:

DataFrame df1 = DataFrame.newFrame("b", "a").foldByRow(
        1, 2,
        3, 4);

DataFrame df2 = DataFrame.newFrame("a", "c").foldByRow( (1)
        5, 6,
        7, 8);

DataFrame dfv = df1.vConcat(JoinType.inner, df2); (2)
1 df1 column names are "b" and "c", while df2 - "a" and "c".
2 Explicitly passing JoinType.inner to vConcat. It is done to demonstrate the point. Since inner is the default, omitting it will not change the outcome.
a
-
2
4
5
7
4 rows x 1 column

As you can see in the result, the behavior of inner join is to only keep columns that are present in both df1 and df2. Columns are joined by name, regardless of their order (though the order of columns is preserved in the result, following the order in each DataFame being joined, left to right).

Changing semantics from inner to left gives us all the columns of the leftmost DataFrame, but those columns that are missing from the concatenated DataFrame are filled with nulls in the corresponding positions:

DataFrame dfv = df1.vConcat(JoinType.left, df2);
b    a
---- -
1    2
3    4
null 5
null 7
4 rows x 2 columns

Leaving it as an exercise for the reader to try right and full joins.

hConcat works similarly, however concatenation is done by row position, as there are no row names:

DataFrame df1 = DataFrame.newFrame("a", "b").foldByRow(
        1, 2,
        3, 4,
        5, 6);

DataFrame df2 = DataFrame.newFrame("c", "d").foldByRow(
        7, 8,
        9, 10);

DataFrame dfv = df1.hConcat(JoinType.left, df2);
a b c    d
- - ---- ----
1 2 7    8
3 4 9    10
5 6 null null
3 rows x 4 columns

Joins

Grouping and Aggregating

Window functions

Working with .csv Files

CSV (comma-separated values) is a very common and rather simple format for working with raw data. CSV files can be manipulated programmatically or manually (via Excel/LibreOffice), loaded to / from databases, etc. DFLib supports reading DataFrames from CSV and storing them to CSV. You need to add an extra dependency to your project to take advantage of this functionality:

<dependency>
    <groupId>com.nhl.dflib</groupId>
    <artifactId>dflib-csv</artifactId>
</dependency>

Once you do that, Csv class is the entry point to all the .csv related operations as discussed below.

Reading DataFrame from CSV

The simplest API for reading a CSV is this:

DataFrame df = Csv.load("src/test/resources/f1.csv"); (1)
1 The argument to the "load" method can be a filename, a file, or a Reader, so it can be loaded from a variety of sources.

The result is a DataFrame that matches CSV structure:

A B
- --
1 s1
4 s2
2 rows x 2 columns

DFLib made a few assumptions about the data in f1.csv, namely that the first row represents column names, that all columns are Strings, and that all of them need to be included. These assumptions are not necessarily true with many data sets. So there is a longer form of this API that allows to configure column types, skip rows and columns, or even sample the data without loading the entire CSV in memory. Some examples:

DataFrame df = Csv.loader() (1)
        .skipRows(1) (2)
        .header("x", "y") (3)
        .intColumn("x") (4)
        .load("src/test/resources/f1.csv");
1 Instead of "load" method use "loader".
2 Skip the header row.
3 Provide our own header.
4 Convert the first column to int.
x y
- --
1 s1
4 s2
2 rows x 2 columns
In theory you don’t have to do most of these tweaks via CSV loader. You can load the raw data, and then use standard DataFrame transformations to shape the result. However doing it via the loader allows to optimize both load speed and memory use, so it is usually preferable.

Writing DataFrame to a CSV

Writing to a CSV is equally easy:

Csv.save(df, "target/df.csv"); (1)
1 The argument to the "save" method can be a filename, a file, a Writer (or generally Appendable), so it can be stored to a variety of destinations.

Just like with the loader, CSV saver provides its own set of options, if the defaults are not sufficient:

Csv.saver() (1)
        .createMissingDirs() (2)
        .format(CSVFormat.EXCEL) (3)
        .save(df, "target/csv/df.csv");
1 Instead of "save" method use "saver" to be able to customize the process.
2 If intermediate directories for the output file are missing, create them.
3 Provide an alternative format (either from a collection of known formats, or user-specific).

Working with Relational Databases

Relational databases are arguably the most important type of data stores out there. Also they happen to map really well to DataFrames. DFLib provides advanced support for loading and saving DataFrames to RDBMS. It supports transactions, auto-generation of certain SQL statements, merging of data on top of the existing data ("create-or-update"), building custom SQL for selects and updates. It knows how to treat different DB "flavors" and does a number of other cool database-related things.

To start using all this, you need to import dflib-jdbc module:

<dependency>
    <groupId>com.nhl.dflib</groupId>
    <artifactId>dflib-jdbc</artifactId>
</dependency>

<!-- Of course you also need to import your preferred JDBC driver -->

JdbcConnector

Once the imports are setup, you’d use Jdbc class to obtain an instance of JdbcConnector that can be later used for all DB operations. You may already have a preconfigured javax.sql.DataSource, so the simplest way to create a connector is to pass that DataSource to Jdbc factory method:

JdbcConnector connector = Jdbc.connector(dataSource);

Alternatively you can build everything from scratch:

JdbcConnector connector = Jdbc
        .connector("jdbc:derby:target/derby/mydb;create=true")
        // .driver("com.foo.MyDriver") (1)
        // .userName("root")
        // .password("secret") (2)
        .build();
1 (Optional) Driver name. Some drivers are not properly registered with the DriverManager, and require an explicit declaration.
2 DB account username/password (not needed for our in-memory Derby example, but will definitely be required for a real database).

TableLoader / TableSaver

Once you have the connector, you can start reading and persisting DataFrame data. Many DataFrames map directly to individual tables (or views) defined in the database. For those DFLib provides a set of rather straightforward operations that do not require the user to write SQL (all needed SQL is auto-generated by the framework):

DataFrame df = connector.tableLoader("person").load();
id name             salary
-- ---------------- --------
 1 Jerry Cosin      70000.0
 2 Amanda Gabrielly 85000.0
 3 Joan O'Hara      101000.0
3 rows x 3 columns

Table loader provides a way to customize the load operation. It allows to select specific columns, set the maximum number of rows to read, sample rows, and even specify fetch condition as another DataFrame. Some examples:

DataFrame condition = DataFrame.newFrame("salary")
        .addRow(70_000)
        .addRow(101_000)
        .create();

DataFrame df = connector.tableLoader("person")
        .includeColumns("name", "salary")
        .eq(condition)
        .load();
name        salary
----------- --------
Jerry Cosin 70000.0
Joan O'Hara 101000.0
2 rows x 2 columns

What it doesn’t require (unlike CSV loader) is explicit column types, as the proper value types are inferred from the database metadata.

Table saver allows to save DataFrame to a table. Column names in the DataFrame should match column names in the DB table:

DataFrame df = DataFrame.newFrame("id", "name", "salary")
        .addRow(1, "Jerry Cosin", 70_000)
        .addRow(2, "Amanda Gabrielly", 85_000)
        .addRow(3, "Joan O'Hara", 101_000)
        .create();

connector.tableSaver("person").save(df);

In this scenario table saver executes insert for each DataFrame row. But what if there is already an existing data in the table? There are a few options the user has to overwrite or merge the data:

  • Append the data (that’s what we effectively did above).

  • Delete all existing data before doing the insert.

  • Merge the data by comparing DataFrame and DB table data on PK column(s) or an arbitrary set of columns. Insert missing rows, update the existing rows. If the table has more columns than there are columns in the DataFrame, the data in those extra columns is preserved.

Delete before insert example:

connector.tableSaver("person")
        .deleteTableData()
        .save(df);

Merge by PK example (PK columns are detected from DB metadata) :

connector.tableSaver("person")
        .mergeByPk()
        .save(df);

SqlLoader / SqlSaver

TODO

Working in Jupyter Notebook

While DFLib runs inside regular Java applications, it also works perfectly in a Jupyter notebook, which is a very common environment among data scientists, and is usually associated with Python / pandas. Jupyter Notebook is very visual and allows you to debug each step of your a data load and transformation pipeline individually without a need to rerun the whole thing.

Combining DFLib with Jupyter provides a pandas-like notebook experience, only in pure Java. The code developed in notebook can be later copied and pasted into your application. Java developers should consider using Jupyter as a part of their data project workflow, and DFLib design (and its Jupyter integration features described here) makes it rather easy.

Installing Jupyter with Java Kernel

  • You will need at least Java 9 run Java in Jupyter, as it requires jshell. Java 11 or later is recommended.

  • Install Jupyter. Follow Jupyter installation instructions. It requires Python. Out of the variety of installation options, the authors of this document are using Mac OS, pyenv, Python 3.7, and pip.

  • Install iJava Jupyter "kernel"

Work with DFLib in Jupyter

Start Jupyter

jupyter notebook

The browser opens. Click on "New > Java". In the document cell enter something like this:

%maven com.nhl.dflib:dflib:0.7
%maven com.nhl.dflib:dflib-jupyter:0.7

import com.nhl.dflib.jupyter.*;
import com.nhl.dflib.*;

DFLibJupyter.init(getKernelInstance());

Click "Shift + Return" to execute the cell. If the are no errors, you can start using DFLib API in the following cells. E.g.:

DataFrame df = DataFrame.newFrame("a", "b", "c").foldIntByColumn(-1, 1, 2, 3, 4, 5, 6);

// the result of the last statement in a cell is printed just below it
df

Change Display Parameters

To control truncation behavior, you can use static methods on DFLibJupyter:

DFLibJupyter.setMaxDisplayRows(10);
DFLibJupyter.setMaxDisplayColumnWidth(50);

Notebooks and Version Control

Jupyter places the data generated by the notebook in the notebook itself. This creates a challenge maintaining notebooks under version control. Assuming you are using Git, you will need to add Git hooks to your project to strip off the data before commit.

TODO: a recipe for excluding the data

Unit Testing