"DFLib" (short for "DataFrame Library") is a lightweight, pure Java implementation of DataFrame
. DataFrame
is a
very common structure in data science and Big Data worlds. It provides operations like search,
filtering, joins, aggregations, statistical functions, etc., that are reminiscent of SQL (also of Excel), except you run
them in your app over dynamic in-memory data sets.
There are DataFrame
implementations in Python (pandas), R,
Apache Spark, etc.
DFLib project’s goal is to provide the same functionality for regular Java applications. It is a simple library, that
requires no special infrastructure. DFLib core is dependency-free.
The code in this documentation can be run in any Java IDE, as well as (and this is pretty cool!) in a Jupyter notebook. You will need Java 8 or newer. Running in Jupyter will require Java 9+.
Getting Started with DFLib
Include DFLib in a project. Assuming you are using Maven, start by declaring a "BOM" to have a common version for multiple DFLib modules:
<dependencyManagement>
<dependencies>
<dependency>
<groupId>com.nhl.dflib</groupId>
<artifactId>dflib-bom</artifactId>
<version>0.10</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
Next include DFLib core as a dependency:
<dependency>
<groupId>com.nhl.dflib</groupId>
<artifactId>dflib</artifactId>
</dependency>
Create a DataFrame, do some manipulations with it, print the result:
DataFrame df1 = DataFrame
.newFrame("a", "b", "c")
.foldIntStreamByRow(IntStream.range(1, 10000));
DataFrame df2 = df1.selectRows(
df1.getColumnAsInt(0).indexInt(i -> i % 2 == 0));
System.out.println(Printers.tabular.toString(df2));
When you run this code, console output will look like this:
a b c ---- ---- ---- 4 5 6 10 11 12 16 17 18 ... 9982 9983 9984 9988 9989 9990 9994 9995 9996 1666 rows x 3 columns
We’ll omit print statements in all the following examples, and will simply display their output. Details of printing are discussed in the in "Printers" chapter. |
Main Data Structures
The two essential DFLib classes are Series
and DataFrame
. Series
is a 1-dimensional array-like object, and DataFrame
is a 2-dimensional table-like object. Columns in a DataFrame are stored as Series. Additionally we will discuss
Index
object that stores DataFrame column names.
Both DataFrame and Series (and Index) are fully immutable, so all operations on them return a new instance. Behind the scenes the library shares as much data as possible between instances, so copying these objects does not cause significant performance degradation, and in turn makes DFLib fully thread-safe, with multiple concurrent operations possible on the same data structure. Also immutability means that you can have full snapshots of your data at each step of a transformation, simplifying debugging and auditing of data pipelines.
Series
Series
is the simplest of the two data structures. You can think of it as a wrapper around an array of values.
You can use Series
to model sequences of data such as timestamps in a time series. Series object is parameterized for
the type of data that it holds. So there can be e.g. Series<String>
or a Series<LocalDate>
. There is also an
important category of "primitive" Series (IntSeries
, LongSeries
, DoubleSeries
, BooleanSeries
) that are optimized
for memory use and arithmetic operations. While this would be less efficient, they can also be used as Series of a
corresponding wrapper object type (e.g. IntSeries
is also a Series<Integer>
).
Series
object is important as a building block of DataFrame, but it also defines a number of useful data
manipulation and transformation operations on its own. Those will be covered in the following chapters. Here we’ll
discuss how to create Series.
Creating Series from Arrays
Series
can be created using static forData
method on the Series
interface:
Series<String> s = Series.forData("a", "bcd", "ef", "g");
a bcd ... g 4 elements
Primitive Series classes have their own factory methods. E.g.:
IntSeries is = IntSeries.forInts(0, 1, -300, Integer.MAX_VALUE);
Creating Series Incrementally
If we don’t have a full array of data to start with, and instead have code producing a sequence of values of unpredictable length, we can use special mutable "accumulators" to store the sequence and then produce a Series out of it. E.g. the following example reads data from an InputStream line by line as Strings:
// InputStream inputStream = ...
Accumulator<String> accum = new ObjectAccumulator<>();
Scanner scanner = new Scanner(inputStream);
while (scanner.hasNext()) {
accum.add(scanner.next());
}
Series<String> s = accum.toSeries();
Similarly there are IntAccumulator
, LongAccumulator
, etc.
DataFrame
DataFrame
is an in-memory table with rows and columns. Each column has a name, and is represented as a Series
object.
DataFrame can contain columns of different kinds and is not parameterized for a single type.
There are a few ways to create a DataFrame. Here we’ll show "builder" API to convert arrays, Streams, Collections, Series and other in-memory data into DataFrames.
Alternatively a DataFrame can be loaded from (and saved to) a database or a CSV file. Those data sources are discussed in separate documentation chapters. |
First example - adding data row by row:
DataFrame df = DataFrame
.newFrame("name", "age") (1)
.addRow("Joe", 18) (2)
.addRow("Andrus", 45)
.addRow("Joan", 32)
.create();
1 | Builder is created from column names |
2 | Passing individual rows one by one |
The resulting DataFrame looks like this:
name age ------ --- Joe 18 Andrus 45 Joan 32 3 rows x 2 columns
Another example - if we have a single array and need to "fold" it into a two-dimensional DataFrame row-by-row:
DataFrame df = DataFrame
.newFrame("name", "age") (1)
.foldByRow("Joe", 18, "Andrus", 45, "Joan", 32); (2)
1 | Builder is created from column names |
2 | Passing a varargs array of values, that is folded to match the specified number of columns, row by row. |
Same, but folding column-by-column:
DataFrame df = DataFrame
.newFrame("name", "age")
.foldByColumn("Joe", "Andrus", "Joan", 18, 45, 32);
You can also create DataFrames from Collections and Streams (folded either by row or by column). Here is an example
of how to use Streams of primitive ints, creating a DataFrame made of memory-efficient IntSeries
columns:
DataFrame df = DataFrame
.newFrame("col1", "col2")
.foldIntStreamByColumn(IntStream.range(0, 10000));
col1 col2 ---- ---- 0 5000 1 5001 2 5002 ... 4997 9997 4998 9998 4999 9999 5000 rows x 2 columns
Finally, a DataFrame can be created from an array of Series, each Series representing a column:
DataFrame df = DataFrame
.newFrame("name", "age")
.columns(
Series.forData("Joe", "Andrus", "Joan"),
IntSeries.forInts(18, 45, 32)
);
This is the fastest and most efficient way to make a DataFrame, as an array of Series is how it is structured internally.
Now that we’ve discussed how to create Series and DataFrames, we’ll talk about printing their contents, and then will continue to the chapters that describe data operations.
Index
Index
is somewhat similar to Series. It stores a sequence of String labels, corresponding to column labels in a
DataFrame it is attached to. While you normally won’t need to create or manipulate Index outside of a DataFrame, you
may use it to access DataFrame columns labels.
DataFrame df = DataFrame
.newFrame("col1", "col2")
.foldIntStreamByColumn(IntStream.range(0, 10000));
String[] labels = df.getColumnsIndex().getLabels(); (1)
1 | Access DataFrame index and read label names. |
Printers
When doing data exploration and running data pipelines, it is important to be able to visualize data at every step.
More advanced forms of visualization include charts and diagrams. But the simplest thing you can do is printing
data to the console. Both DataFrame
and Series
implement toString()
method, that will print their contents
as a single line, truncating large data sets in the process, which is good for debugging applications running on a
server.
A more human-friendly form of output is done by using a tabular printer. Here is how to use the default tabular printer:
DataFrame df = DataFrame
.newFrame("col1", "col2", "col3")
.foldIntStreamByColumn(IntStream.range(0, 10000));
String table = Printers.tabular.toString(df);
System.out.println(table);
It prints the data as follows, displaying at most 6 rows and truncating the rest. Same goes for cell values. Values longer than 30 chars are also truncated (since they are pretty short in our example, value truncation is not obvious here).
col1 col2 col3 ---- ---- ---- 0 3334 6668 1 3335 6669 2 3336 6670 ... 3331 6665 9999 3332 6666 0 3333 6667 0 3334 rows x 3 columns
You can change the truncation parameters by creating your own printer:
Printer printer = new TabularPrinter(3, 3); (1)
String table = printer.toString(df);
System.out.println(table);
1 | Create a printer that displays at most 3 rows and up to 3 characters in each cell. |
c.. c.. c.. --- --- --- 0 3.. 6.. 1 3.. 6.. ... 3.. 6.. 0 3334 rows x 3 columns
If you are using Jupyter Notebook, all the printers are already setup for you. So if the last
statement in a Jupyter cell is a DataFrame or a Series , it will be printed as a table in the notebook.
|
Selecting Rows and Columns
Before analyzing or transforming a Series or a DataFrame, very often they need to be to narrowed down to just the relevant data. API for selecting and filtering rows and columns described in this chapter helps with such task.
Selecting Columns
selectColumns(..)
method allows to do a few things at once: select a subset of columns in the DataFrame,
reorder columns and even duplicate columns.
-
Selecting by column name. In the example below we are getting rid of the "middle" column and reordering "first" and "last" columns:
DataFrame df = DataFrame.newFrame("first", "last", "middle").foldByRow(
"Jerry", "Cosin", "M",
"Amanda", "Gabrielly", null,
"Joan", "O'Hara", "J");
DataFrame df1 = df.selectColumns("last", "first");
last first --------- ------ Cosin Jerry Gabrielly Amanda O'Hara Joan 3 rows x 2 columns
-
Selecting by column position instead of the name:
DataFrame df = DataFrame.newFrame("first", "last", "middle").foldByRow(
"Jerry", "Cosin", "M",
"Amanda", "Gabrielly", null,
"Joan", "O'Hara", "J");
DataFrame df1 = df.selectColumns(1, 0);
Deleting Columns
DataFrame.dropColumns(..)
is an operation complimentary to selectColumns
. It allows to remove one or more columns,
but does not allow to reorder the remaining columns.
DataFrame df = DataFrame.newFrame("first", "last", "middle").foldByRow(
"Jerry", "Cosin", "M",
"Amanda", "Gabrielly", null,
"Joan", "O'Hara", "J");
DataFrame df1 = df.dropColumns("first", "middle"); (1)
1 | Drop one or more columns from the DataFrame. |
last --------- Cosin Gabrielly O'Hara 3 rows x 1 column
Selecting Rows
-
Selecting a subset of rows by position. As there are no row labels in DFLib yet, selecting by position is the only "direct" selection option. Just like with column selection, it allows to duplicate rows by referencing the same position more than once, as well as reorder rows:
DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
"Jerry", "Cosin",
"Amanda", "Gabrielly",
"Joan", "O'Hara");
DataFrame df1 = df.selectRows(2, 0, 0);
first last ----- ------ Joan O'Hara Jerry Cosin Jerry Cosin 3 rows x 2 columns
-
Selecting using an index. You may precalulate an "index" (an
IntSeries
) of row positions and then use it for row selection. An index often comes from applying a condition to aSeries
that is not a part of theDataFrame
. E.g.:
DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
"Jerry", "Cosin",
"Amanda", "Gabrielly",
"Joan", "O'Hara");
IntSeries rowNumbers = new IntSequenceSeries(0, df.height());
IntSeries selectionIndex = rowNumbers.indexInt(i -> i % 3 == 0); (1)
DataFrame df1 = df.selectRows(selectionIndex);
1 | Index is calculated by applying a condition to a Series, that is not a part of the DataFrame, whose rows we are selecting. |
first last ----- ----- Jerry Cosin 1 row x 2 columns
Filtering Rows
A very common form of row selection is "filtering", i.e. building a DataFrame with a subset of rows that match a
conditional expression. Unlike selectRows
, filterRows(..)
methods do not require direct references to row positions.
Also filtering would not allow to reorder or duplicate rows. The following flavors of filtering are available:
-
filterRows(..)
with column condition:
DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
"Jerry", "Cosin",
"Amanda", "Gabrielly",
"Joan", "O'Hara");
DataFrame df1 = df.filterRows(
"first",
(String f) -> f != null && f.startsWith("J"));
first last ----- ------ Jerry Cosin Joan O'Hara 2 rows x 2 columns
-
filterRows(..)
with full row condition:
DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
"Jerry", "Cosin",
"Amanda", "Gabrielly",
"Joan", "O'Hara");
DataFrame df1 = df.filterRows(r ->
r.get("first").toString().startsWith("J")
&& r.get("last").toString().startsWith("O"));
first last ----- ------ Joan O'Hara 1 row x 2 columns
-
filterRows(..)
withBooleanSeries
"index". We’ve already seen how to select rows using an IntSeries index. It allowed to perform index calculation outside the DataFrame, whose rows we are filtering.BooleanSeries
"index" (or "mask") can similarly serve as a filter condition. Such index is usually a result of a prior logical operation between Series. E.g.:
DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
"Jerry", "Cosin",
"Amanda", "Gabrielly",
"Joan", "O'Hara");
Series<String> names = Series.forData("Sandra", "Anton", "Joan");
BooleanSeries mask = names.eq(df.getColumn("first")); (1)
DataFrame df1 = df.filterRows(mask);
1 | Index is calculated by comparing a Series of names with DataFrame column of first names. |
first last ----- ------ Joan O'Hara 1 row x 2 columns
Head and Tail
When you only need the "first N" or "last M" rows of a DataFrame (or values of a Series) you can use head
and tail
operations. Here is how to get the top 2 rows of a DataFrame:
DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
"Jerry", "Cosin",
"Amanda", "Gabrielly",
"Joan", "O'Hara");
DataFrame df1 = df.head(2); (1)
1 | Returns a new DataFrame with two top rows of the original DataFrame. |
first last ------ --------- Jerry Cosin Amanda Gabrielly 2 rows x 2 columns
tail
works similarly, but returns the last N rows:
DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
"Jerry", "Cosin",
"Amanda", "Gabrielly",
"Joan", "O'Hara");
DataFrame df1 = df.tail(1);
first last ----- ------ Joan O'Hara 1 row x 2 columns
Series
also define head(..)
and tail(..)
operations that do what you’d expect from them, returning first or last N
values.
Unlike index operations on arrays and lists in Java, head(..) and tail(..) are safe in regards to bounds
checking. They DO NOT throw exceptions when the length parameter is bigger than DataFrame height (or Series size),
returning an empty DataFame (or Series) instead.
|
Columns Manipulation
Selecting rows and columns, described in the previous chapter, helps to produce a desired subset of data. Here we’ll demonstrate a few extra methods for DataFrame columns manipulation:
-
Adding columns with new data, that may or may not be derived from the data in other DataFrame columns.
-
Renaming columns
-
Normalizing column data and converting it to a specific Java type
Adding Columns
There are a few ways to add columns to a DataFrame:
-
Add a column by calculating cell values, based on data from other columns:
DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
"Jerry", "Cosin",
"Amanda", "Gabrielly",
"Joan", "O'Hara");
DataFrame df1 = df.addColumn(
"full", (1)
r -> r.get("first") + " " + r.get("last") (2)
);
1 | New column name |
2 | Operation to produce values for the new column from the existing row. r is a special RowProxy object,
providing access to the row being processed. |
first last full ------ --------- ---------------- Jerry Cosin Jerry Cosin Amanda Gabrielly Amanda Gabrielly Joan O'Hara Joan O'Hara 3 rows x 3 columns
-
Add multiple columns in a single method call (generally faster than adding them individually) :
DataFrame df1 = df.addColumns(
new String[]{"last_initial", "first_initial"},
r -> r.get("last").toString().charAt(0),
r -> r.get("first").toString().charAt(0)
);
first last last_initial first_initial ------ --------- ------------ ------------- Jerry Cosin C J Amanda Gabrielly G A Joan O'Hara O J 3 rows x 4 columns
-
Adding a Series. If you already have a Series object of the size matching the DataFrame height, it can be added as a column. E.g. we can add a column that will number DataFrame rows:
IntSeries rowNumbers = new IntSequenceSeries(0, df.height()); (1)
DataFrame df1 = df.addColumn(
"number",
rowNumbers
);
1 | A Series that produces a sequence of numbers from 0 to df.height() . |
first last number ------ --------- ------ Jerry Cosin 0 Amanda Gabrielly 1 Joan O'Hara 2 3 rows x 3 columns
For this particular task (adding row numbers) there’s a shortcut that does the same thing as the code above:
DataFrame df1 = df.addRowNumber("number");
Renaming Columns
Often DataFrames come from external sources (like CSV files or database tables) with column names requiring further cleanup to match the expectations of the application. There are a number of ways to rename columns:
-
Rename individual columns:
DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
"Jerry", "Cosin",
"Joan", "O'Hara");
DataFrame df1 = df
.renameColumn("first", "first_name")
.renameColumn("last", "last_name");
first_name last_name ---------- --------- Jerry Cosin Joan O'Hara 2 rows x 2 columns
-
Specify the new names of all columns at once. This may not be practical for a large number of columns:
DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
"Jerry", "Cosin",
"Joan", "O'Hara");
DataFrame df1 = df.renameColumns("first_name", "last_name");
first_name last_name ---------- --------- Jerry Cosin Joan O'Hara 2 rows x 2 columns
-
Rename using a map of old names to new names. Names not present in the map will remain unchanged:
Map<String, String> nameMap = new HashMap<>();
nameMap.put("first", "first_name");
DataFrame df = DataFrame.newFrame("first", "last").foldByRow(
"Jerry", "Cosin",
"Joan", "O'Hara");
DataFrame df1 = df.renameColumns(nameMap);
first_name last ---------- ------ Jerry Cosin Joan O'Hara 2 rows x 2 columns
-
Rename using a function, applied to all column names in turn. This is useful e.g. to convert all names to lower case:
DataFrame df = DataFrame.newFrame("FIRST", "LAST").foldByRow(
"Jerry", "Cosin",
"Joan", "O'Hara");
DataFrame df1 = df.renameColumns(String::toLowerCase);
first last ----- ------ Jerry Cosin Joan O'Hara 2 rows x 2 columns
Changing Column Type
Data may come to us in various forms. Often it comes as Strings or binary objects, leaving it up to the application
to interpret them. There are a few methods in DataFrame to convert column values from one type to another. The most
generic is convertColumn
, allowing to convert between any two data types:
DataFrame df = DataFrame.newFrame("year", "sales").foldByRow(
"2017", "2005365.01",
"2018", "4355098.75");
DataFrame df1 = df
.convertColumn(0, (String s) -> Integer.valueOf(s))
.convertColumn(1, (String s) -> new BigDecimal(s));
Additionally there are methods that can convert Strings and Numbers to primitive values:
DataFrame df1 = df
.toIntColumn("year", 0) (1)
.toDoubleColumn("sales", 0.); (1)
1 | The second method argument is the primitive value to use if the value being converted is null |
toIntColumn
and other such methods are not just syntactic sugar. They convert DataFrame columns to primitive
Series, that take significantly less memory and compute most operations faster than Object-based ones. You should
always consider using them where possible.
In addition to methods for numeric primitives DataFrame has toBooleanColumn
, toEnumFromNumColumn
and
toEnumFromStringColumn
methods that serve similar purpose, only for booleans and enums.
Cleaning up Data
Data coming to DFLib may be of varying quality and hence may require cleanup before processing. We’ve already discussed
column type conversion operation, that allows to normalize data among other things. Another
typical case of data requiring cleanup is missing values. Such values are represented as regular Java nulls
. There
is a variety of reasons why data can be missing. A rather common one is outer joins and concatenation operations
described further down in this document.
In all of these cases you may have to guess or extrapolate the actual values based on some internal knowledge of the
dataset. There are methods in both Series
and DataFrame
that allow to replace nulls with either fixed values or
values from adjacent cells. First, let’s see how this works for Series
…
Filling Nulls in Series
The simplest approach is to replace nulls with a constant value:
Series<String> withNulls = Series.forData("a", "b", null, null, "c");
Series<String> filled = withNulls.fillNulls("X");
a b X X c 5 elements
Another strategy is to use values adjacent to the cells with nulls:
Series<String> filled = withNulls.fillNullsForward(); (1)
1 | fillNullsForward takes a non-null value preceding the null cell and uses it to replace nulls.
Similarly there’s a fillNullsBackward method that uses the value following the null cell. |
a b b b c 5 elements
Finally, there’s a way to fill nulls using another Series
object as a kind of "template" or "mask" :
Series<String> mask = Series.forData("A", "B", "C", "D", "E", "F", "G");
Series<String> filled = withNulls.fillNullsFromSeries(mask);
a b C D c 5 elements
Notice that the "mask" Series can be longer (or shorter) than the Series where the nulls are replaced. Values between the two Series objects are aligned by position starting at zero.
Filling Nulls in DataFrame
Filling nulls in a DataFrame is done by column, so the API is very similar to Series.fillNulls*
methods, only with
an additional String on integer argument specifying column position.
DataFrame withNulls = DataFrame.newFrame("c1", "c2").foldByRow(
"1", "1",
null, null,
"2", "2");
DataFrame filled = withNulls
.fillNullsBackwards("c1")
.fillNullsForward("c2");
c1 c2 -- -- 1 1 2 1 2 2 3 rows x 2 columns
Sorting
You can sort values in Series
, and sort rows in DataFrames
. As is the case everywhere else in DFLib, sorting does
not alter the original object, and instead creates a new instance of either Series or DataFrame.
First let’s look at sorting Series…
Sorting Series
Series provides a sort
method to sort its data using a custom Comparator:
Series<String> s = Series.forData("12", "1", "123")
.sort(Comparator.comparingInt(String::length));
The result is a new Series with sorted data:
1 12 123 3 elements
Another example shows how to sort Series in the "natural" order (requires Series type to be compatible with
java.lang.Comparable
):
Series<String> s = Series.forData("c", "d", "a")
.sort(Comparator.naturalOrder());
a c d 3 elements
Series of primitives provide methods to sort values in the natural order without an explicit comparator:
LongSeries s = LongSeries.forLongs(Long.MAX_VALUE, 15L, 0L)
.sortLong();
0 15 9223372036854775807 3 elements
Additionally IntSeries
has an optimized method to sort ints with a custom IntComparator
.
Next we’ll check how to sort a DataFrame.
Sorting Rows by Column
Rows in a DataFrame can be sorted by any single column, assuming that column contains values that implement
java.lang.Comparable
(i.e. Strings, numbers, etc):
DataFrame df = DataFrame.newFrame("first", "last", "middle").foldByRow(
"Jerry", "Cosin", "M",
"Amanda", "Gabrielly", null,
"Joan", "O'Hara", "J");
DataFrame df1 = df.sort("first", true); (1)
1 | The first argument is the column name (can also be column integer index), the second - a boolean indicating sort
direction (true means ascending, false - descending order). |
first last middle ------ --------- ------ Amanda Gabrielly null Jerry Cosin M Joan O'Hara J 3 rows x 3 columns
Sort column can either be specified by name, as in the example above, or by position:
DataFrame df1 = df.sort(0, true);
Sorting Rows by Multiple Columns
To specify additional sort columns, you can provide an array of column names (or column positions) and an array of sort direction specs:
DataFrame df = DataFrame.newFrame("first", "last", "middle").foldByRow(
"John", "Cosin", "M",
"Amanda", "Gabrielly", null,
"Joan", "Cosin", "J");
DataFrame df1 = df.sort(new String[]{"last", "first"}, new boolean[]{true, false});
first last middle ------ --------- ------ John Cosin M Joan Cosin J Amanda Gabrielly null 3 rows x 3 columns
Sorting Rows with Custom Algorithm
If it is not possible to sort by values of a specific column or columns (e.g. when they are not Comparable
, or we
need to use a non-standard ordering algorithm), you can provide your own function that generates an arbitrary
Comparable
value for each row in the DataFrame:
DataFrame df = DataFrame.newFrame("first", "last", "middle").foldByRow(
"Jerry", "Cosin", "M",
"Amanda", "Gabrielly", null,
"Joan", "O'Hara", "J");
DataFrame df1 = df.sort(r -> r.get(0).toString().length()); (1)
1 | Sorting by the length of first name String. |
first last middle ------ --------- ------ Joan O'Hara J Jerry Cosin M Amanda Gabrielly null 3 rows x 3 columns
Concatenating
There are various ways to combine data from multiple Series
or DataFrames
. The simplest is concatenation,
described in this chapter. Series can be concatenated together producing a longer Series. DataFrames can be concatenated
either vertically (along the rows axis) or horizontally (along the columns axis).
Concatenating Series
If you have a Series object and want to concatenate it with one or more other Series, you’d use Series.concat(..)
method:
Series<String> s1 = Series.forData("x", "y", "z");
Series<String> s2 = Series.forData("a");
Series<String> s3 = Series.forData("m", "n");
Series<String> sConcat = s1.concat(s2, s3);
The result is as expected contains values of all Series put together:
x y z a m n 6 elements
If you have a collection or an array of Series and want to "glue" them together, you can use a static
SeriesConcat.concat(..)
:
Collection<Series<String>> ss = asList(
Series.forData("x", "y", "z"),
Series.forData("a"),
Series.forData("m", "n"));
Series<String> sConcat = SeriesConcat.concat(ss);
The result is the same as in the previous example. Also you can concatenate Series with itself:
Series<String> s = Series.forData("x", "y");
Series<String> sConcat = s.concat(s);
x y x y 4 elements
Concatenating DataFrames
DataFrame offers two options for concatenation - vertical (stacking DataFrames on top of each other) and horizontal (putting them next to each other). Let’s see some examples..
DataFrame df1 = DataFrame.newFrame("a", "b").foldByRow(
1, 2,
3, 4);
DataFrame df2 = DataFrame.newFrame("a", "b").foldByRow(
5, 6,
7, 8);
DataFrame dfv = df1.vConcat(df2); (1)
1 | Concatenate this and another DataFrame vertically. The argument is a vararg, so more than one DataFrame can be passed to the method. |
a b - - 1 2 3 4 5 6 7 8 4 rows x 2 columns
DataFrame dfh = df1.hConcat(df2); (1)
1 | Concatenate this and another DataFrame horizontally. The argument is a vararg, so more than one DataFrame can be passed to the method. |
a b a_ b_ - - -- -- 1 2 5 6 3 4 7 8 2 rows x 4 columns
Since both df1 and df2 had the same column names, and a DataFrame must only have unique columns, _ suffix was
automatically appended to the conflicting columns in the resulting DataFrame. We will see this auto-renaming behavior
in other places, such as joins.
|
So far all our concatenation examples consisted of DataFrames that had matching dimensions (same number and names of
columns for vConcat
and same number of rows for hConcat
). But what if concatenated DataFrames are shaped or named
differently?
Concat methods may take an extra "how" parameter to define concatenation semantics. The type of the "how" parameter is
JoinType
, and can be one of inner
(default), left
, right
, full
.
We’ll see JoinType again soon when discussing joins. Concatenation and joins are related fairly
closely.
|
Let’s look how this works with vConcat
:
DataFrame df1 = DataFrame.newFrame("b", "a").foldByRow(
1, 2,
3, 4);
DataFrame df2 = DataFrame.newFrame("a", "c").foldByRow( (1)
5, 6,
7, 8);
DataFrame dfv = df1.vConcat(JoinType.inner, df2); (2)
1 | df1 column names are "b" and "c", while df2 - "a" and "c". |
2 | Explicitly passing JoinType.inner to vConcat . It is done to demonstrate the point. Since inner is the
default, omitting it will not change the outcome. |
a - 2 4 5 7 4 rows x 1 column
As you can see in the result, the behavior of inner
join is to only keep columns that are present in both df1
and
df2
. Columns are joined by name, regardless of their order (though the order of columns is preserved in the result,
following the order in each DataFame being joined, left to right).
Changing semantics from inner
to left
gives us all the columns of the leftmost DataFrame, but those columns that
are missing from the concatenated DataFrame are filled with nulls in the corresponding positions:
DataFrame dfv = df1.vConcat(JoinType.left, df2);
b a ---- - 1 2 3 4 null 5 null 7 4 rows x 2 columns
Leaving it as an exercise for the reader to try right
and full
joins.
hConcat
works similarly, however concatenation is done by row position, as there are no row names:
DataFrame df1 = DataFrame.newFrame("a", "b").foldByRow(
1, 2,
3, 4,
5, 6);
DataFrame df2 = DataFrame.newFrame("c", "d").foldByRow(
7, 8,
9, 10);
DataFrame dfv = df1.hConcat(JoinType.left, df2);
a b c d - - ---- ---- 1 2 7 8 3 4 9 10 5 6 null null 3 rows x 4 columns
Joins
Grouping and Aggregating
Window functions
Working with .csv Files
CSV (comma-separated values) is a very common and rather simple format for working with raw data. CSV files can be manipulated programmatically or manually (via Excel/LibreOffice), loaded to / from databases, etc. DFLib supports reading DataFrames from CSV and storing them to CSV. You need to add an extra dependency to your project to take advantage of this functionality:
<dependency>
<groupId>com.nhl.dflib</groupId>
<artifactId>dflib-csv</artifactId>
</dependency>
Once you do that, Csv
class is the entry point to all the .csv
related operations as discussed
below.
Reading DataFrame from CSV
The simplest API for reading a CSV is this:
DataFrame df = Csv.load("src/test/resources/f1.csv"); (1)
1 | The argument to the "load" method can be a filename, a file, or a Reader, so it can be loaded from a variety of sources. |
The result is a DataFrame that matches CSV structure:
A B - -- 1 s1 4 s2 2 rows x 2 columns
DFLib made a few assumptions about the data in f1.csv
, namely that the first row represents
column names, that all columns are Strings, and that all of them need to be included. These assumptions are not
necessarily true with many data sets. So there is a longer form of this API that allows to configure column types,
skip rows and columns, or even sample the data without loading the entire CSV in memory. Some examples:
DataFrame df = Csv.loader() (1)
.skipRows(1) (2)
.header("x", "y") (3)
.intColumn("x") (4)
.load("src/test/resources/f1.csv");
1 | Instead of "load" method use "loader". |
2 | Skip the header row. |
3 | Provide our own header. |
4 | Convert the first column to int. |
x y - -- 1 s1 4 s2 2 rows x 2 columns
In theory you don’t have to do most of these tweaks via CSV loader. You can load the raw data, and then use standard DataFrame transformations to shape the result. However doing it via the loader allows to optimize both load speed and memory use, so it is usually preferable. |
Writing DataFrame to a CSV
Writing to a CSV is equally easy:
Csv.save(df, "target/df.csv"); (1)
1 | The argument to the "save" method can be a filename, a file, a Writer (or generally Appendable), so it can be stored to a variety of destinations. |
Just like with the loader, CSV saver provides its own set of options, if the defaults are not sufficient:
Csv.saver() (1)
.createMissingDirs() (2)
.format(CSVFormat.EXCEL) (3)
.save(df, "target/csv/df.csv");
1 | Instead of "save" method use "saver" to be able to customize the process. |
2 | If intermediate directories for the output file are missing, create them. |
3 | Provide an alternative format (either from a collection of known formats, or user-specific). |
Working with .avro Files
Avro binary file format is common in data engineering. Its main advantage is that files are saved with an embedded schema. So when they are read back, the fields are converted to the correct original Java types. (This is different from say CSV. Reading a CSV to a DataFrame by default only produces String columns). To save or load DataFrames using the Avro format, include the following dependency:
<dependency>
<groupId>com.nhl.dflib</groupId>
<artifactId>dflib-avro</artifactId>
</dependency>
Once you do that, Avro
class is the entry point to all the operations. With it you can save DataFrames to .avro
files and load them back.
Avro Schema
In most cases you don’t need to know anything about Avro schemas. DFLib automatically generates a schema for a
DataFrame before it is saved. Alternatively you can create a custom Schema
object based on
Avro specification and pass it to AvroSaver
.
TODO…
Working with Relational Databases
Relational databases are arguably the most important type of data stores out there. Also they happen to map really well to DataFrames. DFLib provides advanced support for loading and saving DataFrames to RDBMS. It supports transactions, auto-generation of certain SQL statements, merging of data on top of the existing data ("create-or-update"), building custom SQL for selects and updates. It knows how to treat different DB "flavors" and does a number of other cool database-related things.
To start using all this, you need to import dflib-jdbc
module:
<dependency>
<groupId>com.nhl.dflib</groupId>
<artifactId>dflib-jdbc</artifactId>
</dependency>
<!-- Of course you also need to import your preferred JDBC driver -->
JdbcConnector
Once the imports are setup, you’d use Jdbc
class to obtain an instance of JdbcConnector
that can be later used for
all DB operations. You may already have a preconfigured javax.sql.DataSource
, so the simplest way to create a
connector is to pass that DataSource to Jdbc
factory method:
JdbcConnector connector = Jdbc.connector(dataSource);
Alternatively you can build everything from scratch:
JdbcConnector connector = Jdbc
.connector("jdbc:derby:target/derby/mydb;create=true")
// .driver("com.foo.MyDriver") (1)
// .userName("root")
// .password("secret") (2)
.build();
1 | (Optional) Driver name. Some drivers are not properly registered with the DriverManager, and require an explicit declaration. |
2 | DB account username/password (not needed for our in-memory Derby example, but will definitely be required for a real database). |
TableLoader / TableSaver
Once you have the connector, you can start reading and persisting DataFrame data. Many DataFrames map directly to individual tables (or views) defined in the database. For those DFLib provides a set of rather straightforward operations that do not require the user to write SQL (all needed SQL is auto-generated by the framework):
DataFrame df = connector.tableLoader("person").load();
id name salary -- ---------------- -------- 1 Jerry Cosin 70000.0 2 Amanda Gabrielly 85000.0 3 Joan O'Hara 101000.0 3 rows x 3 columns
Table loader provides a way to customize the load operation. It allows to select specific columns, set the maximum number of rows to read, sample rows, and even specify fetch condition as another DataFrame. Some examples:
DataFrame condition = DataFrame.newFrame("salary")
.addRow(70_000)
.addRow(101_000)
.create();
DataFrame df = connector.tableLoader("person")
.includeColumns("name", "salary")
.eq(condition)
.load();
name salary ----------- -------- Jerry Cosin 70000.0 Joan O'Hara 101000.0 2 rows x 2 columns
What it doesn’t require (unlike CSV loader) is explicit column types, as the proper value types are inferred from the database metadata.
Table saver allows to save DataFrame to a table. Column names in the DataFrame should match column names in the DB table:
DataFrame df = DataFrame.newFrame("id", "name", "salary")
.addRow(1, "Jerry Cosin", 70_000)
.addRow(2, "Amanda Gabrielly", 85_000)
.addRow(3, "Joan O'Hara", 101_000)
.create();
connector.tableSaver("person").save(df);
In this scenario table saver executes insert for each DataFrame row. But what if there is already an existing data in the table? There are a few options the user has to overwrite or merge the data:
-
Append the data (that’s what we effectively did above).
-
Delete all existing data before doing the insert.
-
Merge the data by comparing DataFrame and DB table data on PK column(s) or an arbitrary set of columns. Insert missing rows, update the existing rows. If the table has more columns than there are columns in the DataFrame, the data in those extra columns is preserved.
Delete before insert example:
connector.tableSaver("person")
.deleteTableData()
.save(df);
Merge by PK example (PK columns are detected from DB metadata) :
connector.tableSaver("person")
.mergeByPk()
.save(df);
SqlLoader / SqlSaver
TODO
Working in Jupyter Notebook
While DFLib runs inside regular Java applications, it also works perfectly in a Jupyter notebook, which is a very common environment among data scientists, and is usually associated with Python / pandas. Jupyter Notebook is very visual and allows you to debug each step of your a data load and transformation pipeline individually without a need to rerun the whole thing.
Combining DFLib with Jupyter provides a pandas-like notebook experience, only in pure Java. The code developed in notebook can be later copied and pasted into your application. Java developers should consider using Jupyter as a part of their data project workflow, and DFLib design (and its Jupyter integration features described here) makes it rather easy.
Installing Jupyter with Java Kernel
-
You will need at least Java 9 run Java in Jupyter, as it requires
jshell
. Java 11 or later is recommended. -
Install Jupyter. Follow Jupyter installation instructions. It requires Python. Out of the variety of installation options, the authors of this document are using Mac OS, pyenv, Python 3.7, and pip.
-
Install iJava Jupyter "kernel"
Work with DFLib in Jupyter
Start Jupyter
jupyter notebook
The browser opens. Click on "New > Java". In the document cell enter something like this:
%maven com.nhl.dflib:dflib:0.8
%maven com.nhl.dflib:dflib-jupyter:0.8
import com.nhl.dflib.jupyter.*;
import com.nhl.dflib.*;
DFLibJupyter.init(getKernelInstance());
Click "Shift + Return" to execute the cell. If the are no errors, you can start using DFLib API in the following cells. E.g.:
DataFrame df = DataFrame.newFrame("a", "b", "c").foldIntByColumn(-1, 1, 2, 3, 4, 5, 6);
// the result of the last statement in a cell is printed just below it
df
Change Display Parameters
To control truncation behavior, you can use static methods on DFLibJupyter
:
DFLibJupyter.setMaxDisplayRows(10);
DFLibJupyter.setMaxDisplayColumnWidth(50);
Notebooks and Version Control
Jupyter places the data generated by the notebook in the notebook itself. This creates a challenge maintaining notebooks under version control. Assuming you are using Git, you will need to add Git hooks to your project to strip off the data before commit.
TODO: a recipe for excluding the data